Closed Bug 1289823 Opened 8 years ago Closed 7 years ago

Add "backfill" support for TaskCluster jobs

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: bstack)

References

Details

Attachments

(2 files)

We're soon adding SETA for TaskCluster and when we find an orange for test jobs we someitmes would like to request backfilling of that job across pushes.

For Buildbot, mozci can determine which pushes lack the job as well as if the builds for those tests have completed.

With TC is easier since I can know what build a test job depends on.

Anyways, once SETA is enabled we can look into adding this feature.
Blocks: 1080265
Will this be done in pulse_actions?
Yes
Summary: Add backfilling support for TaskCluster jobs → pulse_actions - Add "backfill" support for TaskCluster jobs
Assignee: nobody → bstack
Removing pulse_actions from the bug summary since this might be implemented under taskcluster-treeherder.
Summary: pulse_actions - Add "backfill" support for TaskCluster jobs → Add "backfill" support for TaskCluster jobs
Comment on attachment 8815086 [details]
Bug 1289823 - Make taskcluster action-task more flexible

https://reviewboard.mozilla.org/r/96072/#review96476
Attachment #8815086 - Flags: review?(armenzg) → review+
Comment on attachment 8815086 [details]
Bug 1289823 - Make taskcluster action-task more flexible

https://reviewboard.mozilla.org/r/96072/#review96480
Attachment #8815086 - Flags: review?(dustin) → review+
Attachment #8817283 - Flags: review?(dustin)
Attachment #8817283 - Flags: review?(armenzg)
Comment on attachment 8817283 [details]
Bug 1289823 - Add backfilling as an action-task

https://reviewboard.mozilla.org/r/97634/#review98116

This is on a good path.

Could you please address the comments and re-flag for review?

::: taskcluster/mach_commands.py:184
(Diff revision 3)
> +                     required=True,
> +                     help="Decision Task ID of the reference decision task")
> +    @CommandArgument('--task-labels',
> +                     required=True,
> +                     help='Comma separated list of task labels to be scheduled')
> +    def taskgraph_add_tasks(self, **options):

The subcommand is called 'action-task' while the function is called '_add_tasks'.

The next command is 'add-tasks' while the function is '_action'.

Is this intentional?

::: taskcluster/taskgraph/action.py:22
(Diff revision 3)
>  from .taskgraph import TaskGraph
>  
>  logger = logging.getLogger(__name__)
>  TASKCLUSTER_QUEUE_URL = "https://queue.taskcluster.net/v1/task/"
> +TREEHERDER_URL = "https://treeherder.mozilla.org/api/"
> +MAX_BACKFILL_RESULTSETS = 5

Let's add a note that currently SETA sets the counter for every repository to the same value.

If this was to change we would need an API on Treeherder to let you know the correct value for each repository.

::: taskcluster/taskgraph/action.py:82
(Diff revision 3)
> +     a successful job is found or `N` jobs have been scheduled.
> +    """
> +    s = requests.Session()
> +    s.headers.update({"User-Agent": "gecko-intree-backfill-task"})
> +
> +    job_url = TREEHERDER_URL + "project/" + project + "/jobs/" + job_id + "/"

I believe this is easier to read:
```py
job_url = "{}/path/{}".format(foo, bar)
```

::: taskcluster/taskgraph/action.py:86
(Diff revision 3)
> +
> +    job_url = TREEHERDER_URL + "project/" + project + "/jobs/" + job_id + "/"
> +    job = s.get(url=job_url).json()
> +
> +    if job["build_system_type"] != "taskcluster":
> +        raise ValueError("Invalid build system type! Must be a Taskcluster job.")

Can we get to this situation?
How can someone request a backfill of a non-taskcluster job and this backfill task be triggered?

I thought the pulse messages from Treeherder would only notify you of TaskCluster requests.

::: taskcluster/taskgraph/action.py:90
(Diff revision 3)
> +    if job["build_system_type"] != "taskcluster":
> +        raise ValueError("Invalid build system type! Must be a Taskcluster job.")
> +
> +    platform = job["build_platform_id"]
> +    platform_option = job["platform_option"]
> +    job_type = job["job_type_id"]

You can probably turn this into:
```py
job_to_backfill = {
    'build_platform_id': ...
    ...
}
```

This will make it so you can call load_decisions() with just passing `job_to_backfill` and make the comparison of is_job_type() be a comparison between two dictionaries.

::: taskcluster/taskgraph/action.py:93
(Diff revision 3)
> +    platform = job["build_platform_id"]
> +    platform_option = job["platform_option"]
> +    job_type = job["job_type_id"]
> +    task_label = job["job_type_name"]
> +
> +    resultset_url = TREEHERDER_URL + "project/" + project + "/resultset/"

Same as above.

::: taskcluster/taskgraph/action.py:103
(Diff revision 3)
> +    decisions = load_decisions(s, project, platform, platform_option, resultsets, job_type)
> +
> +    [add_tasks(decision, [task_label], '{}-'.format(decision)) for decision in decisions]
> +
> +
> +def load_decisions(s, project, platform, platform_option, resultsets, job_type):

Could you please add a docstring describing what we're doing in here?

::: taskcluster/taskgraph/action.py:104
(Diff revision 3)
> +
> +    [add_tasks(decision, [task_label], '{}-'.format(decision)) for decision in decisions]
> +
> +
> +def load_decisions(s, project, platform, platform_option, resultsets, job_type):
> +    url = TREEHERDER_URL + "project/" + project + "/jobs/"

Please get rid of this variable since you don't need it for both loops.

::: taskcluster/taskgraph/action.py:107
(Diff revision 3)
> +
> +def load_decisions(s, project, platform, platform_option, resultsets, job_type):
> +    url = TREEHERDER_URL + "project/" + project + "/jobs/"
> +    decisions = []
> +    for resultset in resultsets:
> +        params = {"push_id": resultset, "count": 2000}

I think that we're getting to the place where we can't use the magical 2000 count.
IIRC it returns 2000 jobs for a push.
We will soon be hitting that limit if we don't already.

Could you please file a bug for TH team to have a way to simply say "give me all jobs for a push" rather than "give me X jobs for a push".

Not blocking on landing. Double the number for now.
If you've seen that number anywhere else please let me know! :D

::: taskcluster/taskgraph/action.py:111
(Diff revision 3)
> +    for resultset in resultsets:
> +        params = {"push_id": resultset, "count": 2000}
> +        results = s.get(url=url, params=params).json()["results"]
> +        filtered = [j for j in results if is_job_type(j, platform, job_type, platform_option)]
> +        if len(filtered) > 1:
> +            raise Exception("Too many jobs matched. Aborting.")

I believe this raises an exception if too many jobs are found for the job we want to backfill on one of the many pushes we're considering.

In such case, will we schedule anything?

::: taskcluster/taskgraph/action.py:124
(Diff revision 3)
> +        details = s.get(url=TREEHERDER_URL + "jobdetail/", params=params).json()["results"]
> +        inspect = [detail["url"] for detail in details if detail["value"] == "Inspect Task"][0]
> +
> +        # Pull out the taskId from the URL e.g.
> +        # oN1NErz_Rf2DZJ1hi7YVfA from tools.taskcluster.net/task-inspector/#oN1NErz_Rf2DZJ1hi7YVfA/
> +        decision_ids.append(inspect.partition('#')[-1].rpartition('/')[0])

Could you please start a thread for the TH team?
I would like the jobdetail (or another API) to have the task ID of each TaskCluster job.
You should not have to do this.

Not a blocker issue.
Attachment #8817283 - Flags: review?(armenzg) → review-
Comment on attachment 8817283 [details]
Bug 1289823 - Add backfilling as an action-task

https://reviewboard.mozilla.org/r/97634/#review98152

::: taskcluster/taskgraph/action.py:86
(Diff revision 3)
> +
> +    job_url = TREEHERDER_URL + "project/" + project + "/jobs/" + job_id + "/"
> +    job = s.get(url=job_url).json()
> +
> +    if job["build_system_type"] != "taskcluster":
> +        raise ValueError("Invalid build system type! Must be a Taskcluster job.")

This might be better as a short-circuit return -- no need to mark this action task as a failure in this csae..

::: taskcluster/taskgraph/action.py:100
(Diff revision 3)
> +    results = s.get(url=resultset_url, params=params).json()["results"]
> +    resultsets = [resultset["id"] for resultset in results]
> +
> +    decisions = load_decisions(s, project, platform, platform_option, resultsets, job_type)
> +
> +    [add_tasks(decision, [task_label], '{}-'.format(decision)) for decision in decisions]

I think the gecko style is to prefer explicit iteration (`for ..: ..`) when the side effect is more important than the resulting list.
Attachment #8817283 - Flags: review?(dustin)
Comment on attachment 8817283 [details]
Bug 1289823 - Add backfilling as an action-task

https://reviewboard.mozilla.org/r/97634/#review98116

> Let's add a note that currently SETA sets the counter for every repository to the same value.
> 
> If this was to change we would need an API on Treeherder to let you know the correct value for each repository.

This is the maximum to backfill, right?  In which case this value can be fixed, but must be at least as large as the maximum SETA counter.

> I think that we're getting to the place where we can't use the magical 2000 count.
> IIRC it returns 2000 jobs for a push.
> We will soon be hitting that limit if we don't already.
> 
> Could you please file a bug for TH team to have a way to simply say "give me all jobs for a push" rather than "give me X jobs for a push".
> 
> Not blocking on landing. Double the number for now.
> If you've seen that number anywhere else please let me know! :D

If there's not an established way to use a continuation token of some sort to iterate over these results, we should probably raise that as an issue for the TH folks.  2000 is a lot for one HTTP request.  4000 is a LOT.  Even more will be very problematic..
Comment on attachment 8817283 [details]
Bug 1289823 - Add backfilling as an action-task

https://reviewboard.mozilla.org/r/97634/#review98416

::: taskcluster/taskgraph/action.py:86
(Diff revision 3)
> +
> +    job_url = TREEHERDER_URL + "project/" + project + "/jobs/" + job_id + "/"
> +    job = s.get(url=job_url).json()
> +
> +    if job["build_system_type"] != "taskcluster":
> +        raise ValueError("Invalid build system type! Must be a Taskcluster job.")

This is partially here just as a sanity check in case it gets triggered for a non-tc job. I can remove it if we think it is unnecessary, but I can see situations where I might mess up the service somehow and this would get triggered for some other sort of job.

I can see how this is kinda like a ```a = 2 + 2; assert(a == 4)```, but hopefully not that bad.

::: taskcluster/taskgraph/action.py:104
(Diff revision 3)
> +
> +    [add_tasks(decision, [task_label], '{}-'.format(decision)) for decision in decisions]
> +
> +
> +def load_decisions(s, project, platform, platform_option, resultsets, job_type):
> +    url = TREEHERDER_URL + "project/" + project + "/jobs/"

I've fixed this by naming it more nicely and pulling the other url up to the top as well. It flows a bit more nicely this way I think.

::: taskcluster/taskgraph/action.py:107
(Diff revision 3)
> +
> +def load_decisions(s, project, platform, platform_option, resultsets, job_type):
> +    url = TREEHERDER_URL + "project/" + project + "/jobs/"
> +    decisions = []
> +    for resultset in resultsets:
> +        params = {"push_id": resultset, "count": 2000}

I just saw that number in Treeherder UI itself and cargo-culted it here. The MAX_JOBS_COUNT is 2000 afaict, but it does appear to have "offset" available, so we can iterate through.

::: taskcluster/taskgraph/action.py:111
(Diff revision 3)
> +    for resultset in resultsets:
> +        params = {"push_id": resultset, "count": 2000}
> +        results = s.get(url=url, params=params).json()["results"]
> +        filtered = [j for j in results if is_job_type(j, platform, job_type, platform_option)]
> +        if len(filtered) > 1:
> +            raise Exception("Too many jobs matched. Aborting.")

I had this in here to assert that my understanding of the system was accurate. If I had too few filtered attributes, this check would fail. If we know that the amount of attributes we're filtering on is correct, then we can remove this.

::: taskcluster/taskgraph/action.py:124
(Diff revision 3)
> +        details = s.get(url=TREEHERDER_URL + "jobdetail/", params=params).json()["results"]
> +        inspect = [detail["url"] for detail in details if detail["value"] == "Inspect Task"][0]
> +
> +        # Pull out the taskId from the URL e.g.
> +        # oN1NErz_Rf2DZJ1hi7YVfA from tools.taskcluster.net/task-inspector/#oN1NErz_Rf2DZJ1hi7YVfA/
> +        decision_ids.append(inspect.partition('#')[-1].rpartition('/')[0])

Created https://bugzilla.mozilla.org/show_bug.cgi?id=1323110
Comment on attachment 8817283 [details]
Bug 1289823 - Add backfilling as an action-task

https://reviewboard.mozilla.org/r/97634/#review98574

LGTM
Attachment #8817283 - Flags: review?(armenzg) → review+
Comment on attachment 8817283 [details]
Bug 1289823 - Add backfilling as an action-task

https://reviewboard.mozilla.org/r/97634/#review98416

> I had this in here to assert that my understanding of the system was accurate. If I had too few filtered attributes, this check would fail. If we know that the amount of attributes we're filtering on is correct, then we can remove this.

This is OK. I don't know how to clear the issue on mozreview. Ignore that is still active.
Comment on attachment 8817283 [details]
Bug 1289823 - Add backfilling as an action-task

https://reviewboard.mozilla.org/r/97634/#review98592
Attachment #8817283 - Flags: review?(dustin) → review+
Status: NEW → ASSIGNED
Keywords: checkin-needed
Keywords: checkin-needed
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/1bdd821a52e2
Make taskcluster action-task more flexible r=armenzg,dustin
https://hg.mozilla.org/integration/autoland/rev/5d8a7572ef59
Add backfilling as an action-task r=armenzg,dustin
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/1bdd821a52e2
https://hg.mozilla.org/mozilla-central/rev/5d8a7572ef59
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I see a pulse_actions PR so I assume we're not yet done.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've merged and deployed the latest pulse_actions' PR:
https://github.com/mozilla/pulse_actions/pull/121
Here's the current state of this:

* Treeherder uses tc login, but doesn't store the credentials needed to access tc services directly with a logged-in user's client. This means that until this is fixed (bug 1273096), we need to continue to use pulse_actions to trigger these tasks. Even once that bug is fixed, a decision must be made on bug 1278986.
* pulse_actions currently only has tc scopes to add jobs to try pushes. Pending bug 1284989, we won't be able to enable this for backfilling for any of the repos where backfilling is actually useful. The scopes in question are particularly ["queue:route:tc-treeherder.v2.autoland.*", "queue:route:tc-treeherder-stage.v2.autoland.*"].
* I temporarily gave pulse_actions those scopes to see what would happen. https://tools.taskcluster.net/task-inspector/#USPq6YayRhW0JgnpacFBrw/0 is the result. As you can see, it failed because the action task itself assumes it will only have access to level-1 caches. I assume some discussion about this will also happen in a risk assesment.
Status: REOPENED → ASSIGNED
Depends on: 1324094
Backfilling should be working now. Tested and shown to work on this autoland range: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=linux%2064%20pgo%20tc%20mochitest&fromchange=487f977ef67d5ab211dd3fa5ca565948ededef4e&tochange=ff9a07a874fd1ef273f23aa26cac61a95ba0f342

bc5 tests were backfilled from ff9a07a874fd and added to the previous 5 jobs. Marking this resolved for now. Please reopen if this doesn't work!
Status: ASSIGNED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Component: Treeherder → Task Configuration
Product: Tree Management → Taskcluster
Version: --- → unspecified
Blocks: 1293783
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: