Closed Bug 1168148 Opened 9 years ago Closed 8 years ago

Create pulse listener that listens to Treeherder's job cancellation and retrigger actions to act upon them with mozci

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: armenzg, Assigned: adusca)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

We would like to create a service that deals with self-serve/buildapi scheduling on behalf of TH.
This service will listen to TH actions and trigger them through mozci.
The initial implementation will only deal with self-serve/buildapi until we support TC. NOTE: We will need to talk with the TC team once we aim to tackle such since they already have their own implementation.

At first we will design it to not trigger anything (--dry-run).
Once the treeherder team is satisfied we could look into not hitting self-serve directly from TH and trust that this listener will trigger the required actions.

adusca started today and will be working on this. She has experience with mozci as well as with Pulse.

We intend to make this tool as generic as possible so we can add more listeners and new actions depending on what we want to do.

We're hoping to base the initial implementation on trigger-bot [1]
and deploy it with Heroku.

In the future, this setup will also allow for more sophisticated trigger requests be made by TH (e.g. backfilling, trigger arbitrary builders).

TaskCluster and TH currently follow this model. [2]

jonasfj has a more enhanced version described in [3][4]
I think we can start with a simpler setup and discuss in the future how far we want to take it.

[1] https://github.com/chmanchester/trigger-bot
[2]
armenzg_brb: so I haven't dealt with this piece directly, but I think that TH publishes a message to an exchange when an action occurs (https://github.com/mozilla/treeherder/blob/70330c5842a51d06a81a2f1f60cc1b1988c0ff02/treeherder/model/exchanges.py#L35) that mozilla-taskcluster listens for on that exchange for routes like "taskcluster.#" ....once it
<garndt> schedules a new task, it publishes to a "retrigger-created" exchange
<garndt> armenzg_brb: at least that's what I saw quickly going through the code
<garndt> armenzg_brb: so I don't think there is someting directly notifying TH once the task has been retriggered

Other links:
https://github.com/mozilla/treeherder/blob/adaf5d297658afb7e06fce03d98d94c0043e4f4e/treeherder/model/tasks.py#L100
https://github.com/taskcluster/mozilla-taskcluster/blob/9f31c93f55f3f9060fe8421d8c1b9e83affe5313/docs/treeherder.md#L25

[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1149789
[4] https://groups.google.com/forum/#!topic/mozilla.tools/k17nQ2NF1dI
Also notice that we would like to be able to test changes against treeherder.allizom.org (staging system) before we want to make them live for treeherder.mozilla.org (production system).
Once Heroku is up and running, we can set up logging with:
https://papertrailapp.com
I'm developing this in a new repo: https://github.com/adusca/pulse_actions

Current status:

- worker.py listens to exchange/treeherder/v1/job-actions in an infinite loop. It calls make_retrigger_request when it sees a retrigger action and make_cancel_request when it sees a cancel action. Currently everything is ran with dry_run=True, so instead of doing requests the script just logs the requests it would have made.

- I'm dealing with credentials using a plaintext, manually-written file located at ~/.mozilla/mozci/pulse_credentials.
Question: Should I move make_cancel_request and make_retrigger_request to mozci/sources/buildapi ?

Proposed next step: deploy the script to heroku with dry_run=True(after possible revisions). Let it run for a couple days and see in the logs what would've happened.
Attachment #8610303 - Flags: review?(armenzg)
Output: https://pastebin.mozilla.org/8834632

Nobody cancelled a job while I was testing the script.

There are a lot of identical jobs. Maybe we could add a count parameter to exchange/treeherder/v1/job-actions so when a user retriggers a job N times it sends one message with 'count = N' instead of N messages? That way we could do just one retrigger request with count = N. The function make_retrigger_request already has this functionality.
Comment on attachment 8610303 [details]
Script to listen to pulse and take actions

I'm glad to see you made so much progress in such a shrot time!
The code is looking well organized and clean. I like it.

I'm going to give f+ instead of r+ since for me, review requests are about when the code is believed to be ready to land.

I would like to see that code be more modularized.
The treeherder functionality into its own and so.
I would only keep the worker logic in worker.py.

With regards to TH, there is a TH client that vaibhav and wlach have been working on. Would you be able to see if we can re-use that?

I also would like us to code towards having a config file that specifies:
* what to listen to
* what to trigger

For instance:
* we define a consumer (e.g. TreeherderJobActionsConsumer)
* we define the needed pulse arguments (topic - 'buildbot.#.#')
* If we have various tasks listening to the same consumer with the same arguments we can iterate over the list of actions associated to it
* The actions can be a call to a module.function (I don't know what it would like)

We can have a directory called actions or tasks where new tasks or actions to process pulse messages are defined.

This model will make it easy to add new stuff and be more generic.

> Question: Should I move make_cancel_request and make_retrigger_request to mozci/sources/buildapi ?
Yes, please.
I would like to see mozci being called with the dry_run argument.

What is label for? What is duration for? Where can I read a bit more about it?

You can try a cancel job on one a try push or maybe on http://treeherder.allizom.org.

With regards to multiple retrigger requests, this is due to the way that treeherder's UI works. We could either queue up messages and then evaluate if we can trigger few in one shot.
What are we trying to optimize? Make less requests to buildapi?
Attachment #8610303 - Flags: review?(armenzg) → feedback+
<lightsofapollo> Heroku is a hosting type thing
<lightsofapollo> CircleCI is a CI system similar to travis that integrates nicely
<lightsofapollo> armenzg: correct we handle all interactions to and from TH in there

Taking notes on how TC's setup work.
Alice: according to lightsofapollo as a long term solution we will need to create tasks on TC to call mozci for us.
We can start with *only* using Heroku, however, lightsofapollo warns that we will soon hit issues.

For now let's put these thoughts into the backburner and get our prototype working.

jonasfj recommended that we use node.js instead of python because it is better for async work, however, I would like us to focus on the technologies we currently know before we decide if we are better off moving to something that most of us are not familiar with:
https://etherpad.mozilla.org/jonasfj-armenzg-pulse-hook-service

Once we have a POC, let's see what we change.

#######################################################
lightsofapollo: The line which I am sure you got before... is you want to run the actions on the events in TC
you may end up sad if you run the actions on heroku directly
heroku is reallllly good at handling non-blocking IO stuff

armenzg: lightsofapollo, so Heroku simply dispatches it to TC
lightsofapollo: we would gladly use/contribute to such a system ... I think at least three groups would like to do something along those lines

<armenzg> lightsofapollo, is there a blog post with basic steps to trigger tasks on TC?
<lightsofapollo> jonasfj: ^ probably task inspector is the most basic thing
lightsofapollo> https://tools.taskcluster.net/task-creator/

<lightsofapollo> armenzg: there are similar credentials (perma credentials) which we can issue you for long term stuff
<jonasfj> with correct scopes..
<lightsofapollo> the temp ones will work to start
Current status:

- worker.py [1] now receives exchange and topic as command line arguments. It then uses HANDLERS_BY_EXCHANGE, a dictionary defined in config.py [2] to decide what function it will use to handle incoming messages. Currently only buildbot topics (e.g. buildbot.#.# or buildbot.try.retrigger) in exchange/treeherder/v1/job-actions are supported.

- The functions to deal with TH's job-actions are defined in the module treeherderactions.py [3]

- After talking with mdoglio on IRC, I refactored get_request_id_from_job_id to make only one api call [4]

- output: https://pastebin.mozilla.org/8834740

Note: pulse_actions currently depends on my branch retrigger-cancel in mozilla_ci_tools, because I moved make_retrigger_request and make_cancel_request to mozci/sources/buildapi

[1] https://github.com/adusca/pulse_actions/blob/master/worker.py
[2] https://github.com/adusca/pulse_actions/blob/master/config.py
[3] https://github.com/adusca/pulse_actions/blob/master/treeherderactions.py
[4] https://github.com/adusca/pulse_actions/blob/master/treeherderactions.py#L14-L20
This is looking great. Thanks Alice!

Could you please have a blog post laying down what we're trying to achieve for Monday?
I would like it to be read by a-team members in advance of our Monday meeting and talk a bit about it with everyone so they're aware of what you're building.

In the future, could you have branches and do PR requests to your own repo?
By doing this, I can better see the changes you're making and easily make comments where needed.

You can then merge your changes and create new PRs for any feedback I give you.

In general, we should have some way of reading the config file and start up a pulse listener per entry on the config.
Instead of starting manually with command options.
That way when we push new code to Heroku it reloads the config and adds new listeners automatically.
We should probably create each listener into its own parallel process.

Could you also add some documentation on how to set this up? or info on how to add new stuf?
Even if very brief, it helps indirectly to review mentally what the project is trying to accomplish and if things are modularized properly.

In https://github.com/adusca/pulse_actions/blob/master/worker.py#L64 we use PulseConsumer which is generic, what do you think we could do to help people add their own consumers?
For instance chmanchester use BuildConsumer:
https://github.com/chmanchester/trigger-bot/blob/master/triggerbot/triggerbot_pulse.py#L147

I think defining the class in the config OR in the same file as where the event handler lives. I really don't know.

config.py
#########
Add some info on what the config is supposed to define.

I think this format will be better long term:
HANDLERS_BY_EXCHANGE = {
    "exchange/treeherder/v1/job-actions" : {
        "topics": {
            "buildbot", treeherderactions.on_buildbot_event
        }
    }
}

What do you think?

worker.py
#########
Something I've learned in a video from Raymond about being explicit when calling python function to improve understanding.
Could you add each keyword for each parameter in here?
run_pulse(options.exchange, handler_function, options.topic, dry_run=True)

treeherderactions.py
####################
Move get_request_id_from_job_id() to utils/treeherder.py
Move each other function to handlers/treeherder_buildbot.py and handlers/treeherder_taskcluster.py

What do you think?
About PulseConsumer/ BuildConsumer:

BuildConsumer(**pulse_args) is the same thing as PulseConsumer(exchange='exchange/build/', **pulse_args).

In mozillapulse.consumers, there are a lot of different classes of consumers for different exchanges:
https://hg.mozilla.org/automation/mozillapulse/file/f5dc2a54d170/mozillapulse/consumers.py#l207

Almost all of them have the same format, the only thing that changes between then is the name and the exchange.

Since run_pulse() doesn't know beforehand what exchange it will listen to, I created a class that receives the exchange as a parameter.
Current status:

I made the modifications you suggested. Most importantly, instead of command line arguments worker.py now reads exchange and topic from a file called run_time_config.json. There is still only support for running one pulse listener per time.

The project is now on pypi: https://pypi.python.org/pypi/pulse-actions/0.1 and it should be pip installable after the next mozci release.
Pulse Actions in now deployed to Heroku! Its logs are being tracked by papertrail [1].

I wrote the blog post mentioned in comment 9: http://explique.me/Pulse/

Today I worked on improving the documentation [2]. I even added a 'how to write a hello world service' step-by-step guide [3].

[1] https://addons-sso.heroku.com/apps/pulse-actions/addons/papertrail?q=worker
[2] https://github.com/adusca/pulse_actions/
[3] https://github.com/adusca/pulse_actions/blob/master/hello_world.md
Logs for the first day running on Heroku.
Heroku cycles its dynos every day. That means that pulse_actions is not available for a couple of seconds.

May 29 08:09:47 pulse-actions heroku/worker.1: Cycling 
May 29 08:09:47 pulse-actions heroku/worker.1: State changed from up to starting 
May 29 08:09:49 pulse-actions heroku/worker.1: Stopping all processes with SIGTERM 
May 29 08:09:51 pulse-actions heroku/worker.1: Process exited with status 143 
May 29 08:09:51 pulse-actions heroku/worker.1: Starting process with command `python pulse_actions/worker.py` 
May 29 08:09:52 pulse-actions heroku/worker.1: State changed from starting to up
Is this every day at the same time? I think that is less than 0.01% of the whole day; is that correct?
(5/86400)*100=0.0058%
Is this a 0.9942% availability?
Can't we just make the queues durable?
If it works, sure!
Todo:
Allow grabbing credentials from encrypted env variables:
https://github.com/armenzg/mozilla_ci_tools/issues/237
We would like to test using pulse_actions for retrigger/cancel requests in the ash branch.

For this, treeherder would have to keep sending pulse messages to exchange/treeherder/v1/job-actions but stop doing the actual request to buildapi just in the ash branch. Do you think it's possible?
Flags: needinfo?(emorley)
Yeah you'll just need a conditional on $scope.repoName, wrapping:
https://github.com/mozilla/treeherder/blob/9da6ddbf4151436f20a763d9543b86409214d8c3/ui/plugins/controller.js#L274-L277
...which will stop just the buildapi retrigger, but not the treeherder API POST that then results in the pulse message.

Please file a Treeherder bug for this, marking it blocking this one :-)
Flags: needinfo?(emorley)
Depends on: 1170839
Blocks: 1141262
Depends on: 1171575
Hi Ed, I see that treeherder shows a message when retriggering or cancelling goes well.

How can mozci let TH know that everything went well?
We're working on sending pulse messages upon taking action; would that work for TH?
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #21)
> Hi Ed, I see that treeherder shows a message when retriggering or cancelling
> goes well.
> 
> How can mozci let TH know that everything went well?
> We're working on sending pulse messages upon taking action; would that work
> for TH?

I've tested on Ash and we get a "successful retrigger" or "successful cancel" kind of message on TH.
I assume TH *only* reports that it managed to send a pulse message.
Are we good then?
Flags: needinfo?(emorley)
philor raised some really good points on #releng [1]. Previously, I assumed TH only allowed retrigger/cancel requests from logged in users who had some sort of Mozilla link (a vouched mozillian profile, or something like that).

I tried retriggering on ash with a persona account unrelated to Mozilla and the request was sent. I turned pulse actions off after that. A total of 7 jobs were retriggered using pulse_actions, most of which by me, so no damage.

[1] http://logs.glob.uno/?c=mozilla%23releng#c159615
Depends on: 1032163
The set of people who should be allowed to retrigger and cancel is, well, exactly the set who can now: anyone with whatever the Level 1 commit access LDAP bit is called, scm_1 or something like that. Getting access to push to Try means you should also be able to retrigger your broken jobs, and to cancel the rest when it becomes clear your push is broken. So you can only depend on bug 1032163 if it's going to be fixed by having some sort of a regular scheduled LDAP import, and also if you can count on everyone using the same email address for LDAP and for Persona and good luck with that I'm afraid.
(In reply to Alice Scarpa [:adusca] from comment #23)
> philor raised some really good points on #releng [1]. Previously, I assumed
> TH only allowed retrigger/cancel requests from logged in users who had some
> sort of Mozilla link (a vouched mozillian profile, or something like that).
> 
> I tried retriggering on ash with a persona account unrelated to Mozilla and
> the request was sent. I turned pulse actions off after that. A total of 7
> jobs were retriggered using pulse_actions, most of which by me, so no damage.

This is currently expected. When taskcluster people added the submit to pulse feature to treeherder, they were fine with no additional auth. Coming up with a solution for this is mentioned in bug 1077053 comment 0.

The problem is that:
1) It's insufficient to just depend on say "@mozilla.com" since contributors plus mozilla staff who don't use their moco email
2) Syncing with LDAP is presumably going to be a pain - both from a network flows/auth point of view (particularly if we move to Heroku) and a "people using different emails for ldap vs persona point of view.
3) LDAP access is becoming increasingly irrelevant - given gaia is on github, and people can open PRs to do "try" runs of gaia changes

I think we should:
* do what bug 1032163 comment 0 suggests and use mozillians.org vouched status instead - at least we'd be more likely to have the same email address to cross-reference with the treeherder persona login
* make sure mozci has a log of who performed what actions (and ideally also pass the username onto buildapi's "requested by" or similar)
* [already possible] make sure we have a way of disabling accounts in treeherder, should a mozillians.org vouched user start abusing the system or have their account compromised
Flags: needinfo?(emorley)
While bug 1032163 gets solved we can on the short term put a limit per user (a higher one for mozilla.com accounts).
I'm thinking 100/hour per user and 2000/hour per mozilla.com address.
We can whitelist sheriffs if wanted.

Once bug 1032163 is fixed we can keep the limit on non-vouched users or block them completely but TH should tell them somehow that they can't or they reached a limit (We will have a pulse exchange for mozci/pulse_actions if need be).

We also need to check that when a request is made on buildapi, the user shows up in here:
https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs
No, that's not acceptable. You are not fixing a horrible break in productivity with this, you are solving a few seconds of annoyance once per browser session. It is not acceptable to cripple non-employees for that, nor is it acceptable to jeopardize the entire project by allowing anyone with handful of throwaway email addresses to tie up our entire infrastructure. There is no per-hour number you can choose which will be acceptable, because every number will be either larger than some entire slave pool or smaller than the number we want some contributor working on an intermittent to run.
When TH sends a job action message, it contains an email address that we can trust is controlled by the user who made the request, because of Persona authentication.

Pulse actions can use Mozillians' API [1] to check if that email belongs to a vouched user. That way we don't have to impose limits and throwaway accounts cannot be used to trigger anything.

I believe this would work for now as a work-around, but there are still problems with that approach:

- Users that have a LDAP account but don't have a vouched Mozillian profile will lose retriggering/cancel abilities on TH with this.

- TH will show a message saying 'retrigger request sent' even when pulse_actions will block the user for not being vouched

We can test this approach on ash to improve on the design.

[1] http://mozillians.readthedocs.org/en/latest/api/apiv2/api-users.html
(In reply to Phil Ringnalda (:philor) from comment #27)
> No, that's not acceptable. You are not fixing a horrible break in
> productivity with this, you are solving a few seconds of annoyance once per
> browser session. It is not acceptable to cripple non-employees for that, nor
> is it acceptable to jeopardize the entire project by allowing anyone with
> handful of throwaway email addresses to tie up our entire infrastructure.
> There is no per-hour number you can choose which will be acceptable, because
> every number will be either larger than some entire slave pool or smaller
> than the number we want some contributor working on an intermittent to run.

You're very right. The system as-is already allows capable users to abuse the system.
We're crippling those users due our inability to stop a new set of users.

It will be good to have adusca's experiment as a data point but as she points out we won't be able to support LDAP accounts that are not vouched. Hence, not making this as part of the solution.

I say we move on to tasks that only sheriffs could trigger at first like backfilling or run missing jobs on a revision.
I want to put stress to the system.

philor, does this make more sense to you?
We decided not to use is_vouched status for pulse_actions, but since I had written a small function I'm pasting it here for future reference: https://pastebin.mozilla.org/8837409
For now, the last two things to do before closing this:
* Make sure that buildapi shows the user we're triggering on behalf off
* Add pulse publishing
** This could be used by other tools to know what pulse_actions is up to

We can file another bug to enable this after bug 1032163 is completed.

adusca, would this work for you? Am I missing anything?

I don't want to keep Ash booked longer than we need to and at least wrap the work in here so we don't feel this to be dragging.
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #29)
> I say we move on to tasks that only sheriffs could trigger at first like
> backfilling or run missing jobs on a revision.

Strictly speaking, the set of "sheriffs" who need to do that are "everyone with level 3 commit access" both for twigs and because tomorrow IT will have another of their misnamed "tree closing windows" which is the odd phrase they use to mean "tree left open but we will break as many jobs as we want during a time when we know there will be no sheriff and expect that developers who have been trained to push and never look will instead look and know how to tell their bustage from our bustage and retrigger," but, sure. There you're talking about things that nobody had before you came along, so you are free to restrict them however you want.
Also as another option, we can move no finger wrt to TH support and wait for bug 1032163 since all of these can be run from the command line :)
We could improve the current script abilities by pushing on the mach front.
Blocks: 1178522
Depends on: 1273096
No longer depends on: 1032163
Specifically cancel and re-triggers are dealt with by buildapi and mozilla-taskcluster.
We won't try to support this with pulse_actions/mozci.

A lot of the pieces necessary for this specific request have been accomplished over multiple PRs and bugs, however, re-triggering and cancelling won't be dealt with pulse_actions/mozci.

Thanks for all the help around this project.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: