Closed Bug 1612547 Opened 4 years ago Closed 4 years ago

Define component for doing backfills using existing reports

Categories

(Tree Management :: Perfherder, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: igoldan, Assigned: igoldan)

References

Details

Attachments

(1 file, 1 obsolete file)

We already collect the data we need to perform these backfills. You can find it in the backfill_report & backfill_record tables, on Treeherder.

Define a basic component that's capable of using this data & issue real backfills on Taskcluster.

This code must be added as dead code; that is don't hook it to any kind of Django view or even management script. We'll tackle this in subsequent tasks.

Breakdown (still subject to change):

  • a basic, stateless backfill mechanic is enough
  • should leverage strategy pattern, as we’re planning to implement custom retrigger also, alongside it
  • test coverage
Depends on: 1612542
Assignee: nobody → onegru
See Also: → 1613925
Depends on: 1616263
Assignee: onegru → igoldan
Priority: P2 → P1

My approach for this ticket is reverse engineering based on Treeherder's frontend implementation of the Backfill button, from Jobs view.

After clicking that button, I noticed (via the browser's developer console > Network tab) an exchange of 6 to 8 HTTP calls between Treeherder & Taskcluster. The APIs I recognized there are Queue, Auth & Hooks.

From this exchange, the way I picture the protocol is as follows:

  • request task definition of selected job (includes task group id)
  • request the task's actions.json (not yet sure about its utility though...)
  • request task definition of that job's task group (includes the scopes)
  • request the scopes' expansion
  • trigger the backfill of that job (assuming the expanded scopes are compatible with the actions.json)

Cameron, I noticed you have the most experience in interacting with Taskcluster.
Are my findings correct?
Have I missed something important?

During this research, I also noticed that we're requesting the same actions.json file from 2 different endpoints:

Is it for backwards compatibility? Could we remove one of these duplicated requests (the 2nd one)?

Flags: needinfo?(cdawson)
Attachment #9125392 - Attachment is obsolete: true
Priority: P1 → P2
Blocks: 1618832
Priority: P2 → P1

(In reply to Ionuț Goldan [:igoldan] from comment #3)

My approach for this ticket is reverse engineering based on Treeherder's frontend implementation of the Backfill button, from Jobs view.

After clicking that button, I noticed (via the browser's developer console > Network tab) an exchange of 6 to 8 HTTP calls between Treeherder & Taskcluster. The APIs I recognized there are Queue, Auth & Hooks.

From this exchange, the way I picture the protocol is as follows:

  • request task definition of selected job (includes task group id)
  • request the task's actions.json (not yet sure about its utility though...)
  • request task definition of that job's task group (includes the scopes)
  • request the scopes' expansion
  • trigger the backfill of that job (assuming the expanded scopes are compatible with the actions.json)

Cameron, I noticed you have the most experience in interacting with Taskcluster.
Are my findings correct?
Have I missed something important?

During this research, I also noticed that we're requesting the same actions.json file from 2 different endpoints:

Is it for backwards compatibility? Could we remove one of these duplicated requests (the 2nd one)?

This sounds about right to me. There have been many hands in this set of interactions, so I wonder if an "extra" step was added in getting the actions.json. And I wouldn't be surprised if we needed a bit of refactoring here.

The utility of the actions.json is to let you know which actions you are able to perform on that task.

The code for this set of actions is in ui/models/job.js. in the retrigger function.
https://github.com/mozilla/treeherder/blob/fbdb71d41c18636fa8b31b869217c3526b26d0d0/ui/models/job.js#L90

As far as Treeherder is concerned, we just do 3 steps:

  1. Find the Decision Task
  2. Use that to get the actions.json via the load command
  3. Get the retrigger action out of the actions.json and submit a retrigger request.

I notice in https://github.com/mozilla/treeherder/blob/5d1a34285b1aadbfa39de4fecfac81656482c195/ui/models/taskcluster.js#L18
that there is a taskcluster.getQueue call in both the load and submit functions. Now, I must admit, that I didn't write these functions, though my name is in the git blame for them. I migrated them over during the React rewrite. So my domain knowledge on them is minimal. I think that Tom Prince may have written these functions initially. Perhaps he could help us refactor this code to make it better. There have been LOTS of updates to the taskcluster package we import since then. :)

Flags: needinfo?(cdawson)

Thanks for the feedback, Cameron!

Tom, could you look over the PR I've submitted for this bug?

Flags: needinfo?(mozilla)

https://docs.taskcluster.net/docs/manual/design/conventions/actions has some discussion about actions, including a link to the spec, and some discussion about implementation.

Release engineering's shipit also has some code here for triggering actions. That code looks quite a bit simpler than what you have, but we also have a very narrow use case, so we can make a number of simplifications for that particular case.

I had a brief look at the PR, but it is a large amount of fairly complex code, and it is hard to evaluate without some more context. It isn't clear if there is any code that consumes the new interfaces, so I'm not sure how to evaluate it. I could make a bunch of comments about specific details, but I think it would be more useful to look at the high-level design before commenting on individual lines of code.

Flags: needinfo?(mozilla)

(In reply to Tom Prince [:tomprince] from comment #6)

[...]
I had a brief look at the PR, but it is a large amount of fairly complex code, and it is hard to evaluate without some more context.

In this bug we're especially interested in translating this Javascript implementation for backfilling a job directly into Python code. Basically, the main interest in reviewing the PR is "How close is the Python variant of TaskclusterModel + backfill mechanic to its older Javascript version?".
This particular approach has been brought up to Joel Maher & Cameron Dawson, who already approved it.

I attempted to do a basic 1 to 1 translation, the main difference being that the Python code would be sync, instead of async like the Javascript one (I'm not that familiar with Python's async support, but would gladly add it in the future if need be) .
The Javascript backfill makes use of the TaskclusterModel defined here, which is also fairly complex, as it's used in multiple places on the frontend; I've counted only 8 of them so far.

I agree backfilling in Python is a way narrower use case for now than what's available on the Javascript version of TaskclusterModel. But we already have plans for integrating retriggering & creation of Gecko profiles from the backend side. Possibly others as well. So the complexity & abstraction will inevitably increase in the future and will justify themselves.

The Javascript version of TaskclusterModel is solid & proved to be stable in all these years. I wouldn't like us to implement the same functionality in a different way because:

  • we'd have 2 codebases doing the same thing, but with very different implementations
  • we'd likely come across new bugs that have already been squashed in the Javascript implementation

It isn't clear if there is any code that consumes the new interfaces, so I'm not sure how to evaluate it. [...]

Preparing the local environment is a bit tricky, indeed. You'd need to follow the steps from Treeherder's Direct database access, to get access to a read only replica from Treeherder production.

That'll be needed for the new Django management script from treeherder/perf/management/commands/backfill_perf_jobs.py. It's been added only in this PR.

The script expects a Treeherder id of a performance job. For example, given this Rap(godot) perf job, you can get its id from the selectedJob=293299921 query param, once you've selected the job. Once you run it with ./manage.py backfill_perf_jobs 293299921 (from a local fullstack Treeherder environment), you should be able to see the new AC(Bk) decision task & then the new corresponding perf tasks directly on production Treeherder.

Of course, you'll 1st need to provide your real user's client id & access token via the PERF_SHERIFF_BOT_CLIENT_ID
& PERF_SHERIFF_BOT_ACCESS_TOKEN env variables respectively. Plus, configure the credentials for your read only replica using the DATABASE_URL variable. You set those using a local .env file.

This definitely is a mouthful, so we should probably sync on Zoom to make sure we're following the same steps for running the script.

Flags: needinfo?(mozilla)

Clearing ni? as we already held a meeting about these.

Flags: needinfo?(mozilla)
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: