Closed Bug 1245943 Opened 8 years ago Closed 8 years ago

Re-design pulse_actions to show jobs on Treeherder to give developers an insight of how their scheduling requests went

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Some of the requests can take several minutes to process (e.g. backfilling).

There are various ideas on how to make this work better.
For instance, move away from buildjson files and use Treeherder as our source of truth (which could be innaccurate at times).

The other options is to parallelize.

I came up with an idea which kills two birds in one shot.
I want pulse_actions to schedule a task on Treeherder (add visibility/add public logging) and that task will take its time to determine exactly what to schedule for the user.

The task would have to have the right scopes to schedule jobs (similar to what the gecko decision task can do).

garndt: this is a controlled environment where pulse_actions only listens to requests from logged in developers. pulse_actions also makes sure that we can trust the users making the requests before scheduling anything.

I *might* need to schedule some jobs directly through Buildapi which is probably concerning since I need a set of Buildapi credentials. I might be able to get rid of this requirement but it is unclear at this moment.

I would prefer if there was a way to prevent that job to be re-triggered (I don't want anyone trying to execute scheduling of jobs by multiple tasks).

This is not something I can work on now but I would like to in the next two months after TC work.
garndt: I would like to schedule a task that can schedule other tasks; what is the right way of creating temporary credentials for such task with the right scopes?

Atm, I would only need BBB scopes [1]

[1] https://tools.taskcluster.net/auth/roles/#client-id:bbb-scheduler
Flags: needinfo?(garndt)
Summary: Pulse_actions scheduling requests are falling behind + there is no feedback for the developer → pulse_actions scheduling requests are falling behind + there is no feedback for the developer
I think that as long as your task has the scope "scheduler:extend-task-graph:<graphId>" then you should be able to add the tasks to a json blob located in some file, then in the task payload you would have 
"graphs: ['<some file path']" and the worker will automatically extend the graph.  

I think that might work out for you.
Flags: needinfo?(garndt)
Oh wait, I misunderstood,t his would be with the BBB
I haven't tried it, but your task might be able to get the scope I suggested in comment 2 and then use the taskclusterProxy feature to extend the task graph.
(In reply to Greg Arndt [:garndt] from comment #4)
> I haven't tried it, but your task might be able to get the scope I suggested
> in comment 2 and then use the taskclusterProxy feature to extend the task
> graph.

Thanks garndt; is there any documentation for the taskclusterProxy? or the point of contact?
Summary: pulse_actions scheduling requests are falling behind + there is no feedback for the developer → Re-design pulse_actions to schedule task through TaskCluster in order to give feedback to devs + parallelization
For the record, I want such task to show on Treeherder.
This will require adding the required scopes and variables on the task to show up on Treeherder.
Note to self: If I use a TC task to do the scheduling instead of inside of pulse_actions, I have to make sure that such task cannot be retried or new scheduling will happen.
Assignee: nobody → armenzg
Status: NEW → ASSIGNED
I was trying to do this without using TaskCluster by simply using Treeherder (I was hoping it would be easier).
Unfortunately, I forgot that Heroku does not keep logs on disk but deletes them, thus, making it pointless to make a TH job point to it since it will be gone in a day.
I will have to create a TaskCluster task to store the logs and show the job on TH.
jonasfj pointed me to an API that allows me to upload files to an S3 bucket for 31 days.
I also have my treeherder_submitter package ready for prime time.
I should have something landed by Monday.
I've deployed this to production.

Related changes to this project:
https://github.com/mozilla/pulse_actions/commit/bba2ce71c191ea936b753c3c8b3821275724ae4d - Major refactoring
https://github.com/mozilla/pulse_actions/commit/9b0da8bf7867d567990e5a75156f0545fe0f4436 - Another refactoring
https://github.com/mozilla/pulse_actions/commit/f108989ceda750ad3bf020e314c303be884deb50 - Create log file per request
https://github.com/mozilla/pulse_actions/commit/6e8a6e3717e9ebf4d96986a557839b8a2019962e - Submit Sch job but disabled
https://github.com/mozilla/pulse_actions/commit/32af581b86093e7ecf1c9f7b3c69c753b5407328 - Upload logs to S3 and add to Sch
https://github.com/mozilla/pulse_actions/commit/b2237739f7773de495b7405c623c71d74fe29b8d - Enable for production

Unfortunately, I landed the changes directly on master without squashing from a branch, thus, creating many small commits that still applied towards the development of this bug.

pulse_actions is now a much easier system to test and develop.

In the process of working on this I've also created the following packages:
https://github.com/armenzg/treeherder_submitter
https://github.com/armenzg/taskcluster_s3_uploader
https://github.com/armenzg/pulse_replay - It helps process Pulse messages from a local dump

Here are some scripts that could allow anyone test the functionality of each new package:
https://github.com/armenzg/TC_developer_scheduling_experiments/blob/master/treeherder_submitter_script.py
https://github.com/armenzg/TC_developer_scheduling_experiments/blob/master/store_on_s3.py
Summary: Re-design pulse_actions to schedule task through TaskCluster in order to give feedback to devs + parallelization → Re-design pulse_actions to show jobs on Treeherder to give developers an insight of how their scheduling requests went
I had to disable this temporarily as I think I've found a bug where we post more 'Sch' jobs than needed.
sheriffs; tl;dr when you request add new jobs or backfill, you will see a 'Sch' job which will have information of the processing of the request. I will mention it on dev.platform.

I fixed the issues of when to actually how a Treeherder job rather than for all Pulse messages.

I also fixed a Treeherder issue where jobs with a machine longer than 50 characters do not show up.

Here's my few production submission:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=3053d997f2e9f0fc7dd3960927c92496df1d9ace&filter-searchStr=sch

I tested:
* backfilling
** https://tc-gp-public-31d.s3-us-west-2.amazonaws.com/ateam/pulse-action-dev/d17a6153-17df-4f75-892a-079ff51c562e
* add new jobs
** https://tc-gp-public-31d.s3-us-west-2.amazonaws.com/ateam/pulse-action-dev/241920a6-98bb-457f-83f6-a379711f1d20
* trigger all talos jobs
** https://tc-gp-public-31d.s3-us-west-2.amazonaws.com/ateam/pulse-action-dev/da126ef2-586b-4fe9-aaa3-99ef905b04e3

Here's the code that fixed the remaining issues:
* 50 chars issue
** https://github.com/armenzg/treeherder_submitter/commit/2611496c164da56a918423f5d53c7be623c8cf02
* Pulse messages that are not processed should not show up on Treeherder
** https://github.com/mozilla/pulse_actions/commit/5ca39b9b29aa22a2bd909dfdd998b44b67dd9de1
* Adding new jobs now uses 'requested_jobs' instead of 'buildernames:
** https://github.com/mozilla/pulse_actions/commit/b3abe32d3720122d9f0e53fa4607344bb67c3d12
* Talos handler should not schedule TH jobs
** https://github.com/mozilla/pulse_actions/commit/c975257e62ffc2e7b17fd0a3884b9b47872bb103
* Handlers do not guarantee exit codes; hack around it
** https://github.com/mozilla/pulse_actions/commit/6e23988c2edba57c7c454b1aa590ee6e4523d2ca
Depends on: 1283555
Ealier today we cycled the secret for the Treeherder credentials.
What worked, suddenly stopped working (bug 1283555).
From backend errors we believed it was a clock issue (hawk requires client and server to be within 60 seconds of each other).

After much reading, debugging and filing a Heroku ticket I noticed the env value a bit shifted when I run "heroku config".
I then set it again and voila! It all started working again.

Almost 6 hours later and I can move on.
No longer depends on: 1283555
All known bugs have been resolved!

* If we can't submit a Treeherder job do not fail to process the request
** https://github.com/mozilla/pulse_actions/commit/2896d9b3b5b6a933074da803cb5a9fc3692d0cac
* In previous refactoring we had removed logging level names which are important for Papertrail alerting:
** https://github.com/mozilla/pulse_actions/commit/16fe33a9d09b72086e37fb3afdd08c9d3f6d35d8
* Improved logs for developers by removing levelnames and asctimes
** https://github.com/mozilla/pulse_actions/commit/cccf20fbcd5f637a3e67a137133160c227d561d9
* Fix: requested S3 credentials before we upload a file instead of once at the beggining of the process
** https://github.com/mozilla/pulse_actions/commit/b9a897bd053b8ae4c4f4bf674f7d6fb0117eb41d
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.