Closed Bug 1341934 Opened 4 years ago Closed 3 years ago

Only use the taskcluster-proxy when wanted

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: glandium, Assigned: glandium)

References

Details

Attachments

(1 file)

Currently, taskgraph.task.base and taskgraph.task.docker_image choose between taskcluster-proxy and normal urls based on TASK_ID. This currently works because the jobs running that code and needing the proxy runs in taskcluster docker images that have access to the taskcluster proxy.

Bug 1341214 wants to make the use of taskcluster urls more generic, but can't rely on TASK_ID similarly because not all jobs where TASK_ID is set have access to the taskcluster proxy.
Comment on attachment 8840241 [details]
Bug 1341934 - Change how taskcluster urls are chosen between proxy and non-proxy.

https://reviewboard.mozilla.org/r/114716/#review116404

The taskgraph code only ever runs in the decision docker image, and in that circumstance it will never have TASKCLUSTER_PROXY set.  So this change will have the effect of always using the non-proxy URLs.  Which will be green, but loses the benefit of automatic retries.

The conditional on TASK_ID in docker_image.py and base.py allows the code to function successfully on a developer's system, when running `./mach taskgraph` by hand.  TASK_ID is set on all tasks (regardless of platform), but not on developers' systems.

The issue is that the decision task writes full URLs into various tasks (--installer-url, etc.), and those URLs are then interpreted on a totally different host.  So, to fix bug 1341214, you will need to differentiate the target platforms and only use the proxy version on platforms which support it (docker-worker only, at the moment).
Attachment #8840241 - Flags: review?(dustin) → review-
I'm going to address the core issue in bug 1341214, but I still think TASK_ID is the wrong variable to test for, and that this patch should land, albeit with an additional change that makes the decision graph task itself have TASKCLUSTER_PROXY set.
Please explain your aversion to TASK_ID.  It is set in *precisely* the situations where it needs to be -- set in decision tasks run within a task environment, and not set when developers are running ./mach taskgraph locally.
Comment on attachment 8840241 [details]
Bug 1341934 - Change how taskcluster urls are chosen between proxy and non-proxy.

https://reviewboard.mozilla.org/r/114716/#review116780

::: taskcluster/taskgraph/transforms/task.py:470
(Diff revision 2)
>          worker['env']['USE_SCCACHE'] = '1'
>      else:
>          worker['env']['SCCACHE_DISABLE'] = '1'
>  
> +    if features.get('taskclusterProxy'):
> +        worker['env']['TASKCLUSTER_PROXY'] = '1'

This environment variable is never used..
Attachment #8840241 - Flags: review?(dustin) → review-
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> This environment variable is never used..

... yet.

(In reply to Dustin J. Mitchell [:dustin] from comment #5)
> Please explain your aversion to TASK_ID.  It is set in *precisely* the
> situations where it needs to be -- set in decision tasks run within a task
> environment, and not set when developers are running ./mach taskgraph
> locally.

TASK_ID is set on all docker image jobs AIUI. Which may or may not have all access to the proxy. The only reason TASK_ID works currently, is that the code using that check runs on one particular docker image that does have it set *and* does have access to the proxy.

If some other code (non hypothetical, I'm working on such code) that creates taskcluster urls runs on docker images that *don't* have access to the proxy, there needs to be a different indicator than TASK_ID to tell them whether they can use the proxy or not.
I don't know why we'd schedule a docker image job without access to the proxy?  (And by "docker image job" I think you mean, a task that builds a docker image)

I'm having a hard time imaginging code that runs the taskgraph-generation code that runs in docker, but does not have access to the proxy.  Why not just add {feature: {taskclusterProxy: true}} to that task?

Maybe if you shared more about this hypothetical code, we could get to the heart of the issue?
No longer blocks: 1341214
Depends on: 1341214
I think there are a bunch of concerns that potentially get mixed together.

IMO, we shouldn't use taskcluster-proxy purely for the benefit of retries, since:

1) it is much cheaper to mange this in calling code (taskcluster-proxy requires an additional attached docker container to be managed and consuming resources)
2) if different code paths hit taskcluster-proxy or hit a service directly, the retry semantics should be equivalent. Therefore the retries should be implemented in the calling code, so you don't lose this benefit, just because you stop using the proxy. And you don't end up getting retries of retries, just because you retry in the hit-service-directly case.

As I see it, the only time to use taskcluster-proxy therefore should be when both the following conditions are met:

1) the endpoint being hit is scope protected (e.g. creating a task/uploading an artifact, but not downloading a task log file)
2) the environment from which the request is being made, does not have sufficient taskcluster credentials

Filtering out type 1) is a matter of looking at what calls are being made, and can be considered on a case-by-case basis based on the endpoint being hit. If the call doesn't require scopes, hit the taskcluster service directly, and don't use the proxy.

The only possibility for ruling out type 2 is to check if suitable taskcluster environment variables exist (TASKCLUSTER_CLIENT_ID, TASKCLUSTER_ACCESS_TOKEN, optionally TASKCLUSTER_CERTIFICATE). If they *aren't* set, we've already established that credentials are needed, so you have no choice but to try to use the proxy. If it is there, and it works, great. If it fails - well, you aren't any worse off for trying, because you couldn't authenticate directly anyway since you had no credentials - so you've lost nothing.

IMHO, it was probably a mistake implementing retries in taskcluster-client-proxy, since now calling code implementing retries potentially runs retries of retries. Hindsight is great! :-)

This long explanation was mostly to highlight that I think TASK_ID isn't necessarily so relevant, since it is more a question about whether you need to sign the requests or not, and if you do, do you have any credentials to sign them - if not - you have no choice but to try to use the proxy. In all other cases, if you can hit service directly, do that.
Note, since the main use case of calling a taskcluster service from a task is submitting tasks in a decision task, I still feel it would be better to accommodate this in the task payload, such that the task can say "new tasks will be written to this folder" and when the task execution completes, the worker can read all the task definitions in that folder (for example, a directory of yaml files), check that they are valid, and contain no circular dependencies, then sequence them in an appropriate order to submit to the queue, and then submit them.

This is similar to how docker worker used to extend the task graph by submitting a taskgraph to the queue, but different in the following ways:

1) it will submit to queue, not to (the now deprecated) scheduler
2) it will take care of tracking dependencies between tasks, to avoid submitting a task that depends on another task that has not yet been submitted

This has the following advantages:

1) life is considerably simpler for devs, they do not need to worry that they test a decision task, and real tasks get submitted
2) documentation and ./mach targets are the same in testing and production (we don't need different endpoints that actually submit the tasks versus write them to files)
3) developers need not concern themselves with the mechanics of submitting tasks (e.g. sequencing, retries, determining taskcluster-proxy url to call)
4) we don't need any code to distinguish between whether to use proxy, or queue directly (in the case of submitting tasks)
5) we don't need to have an extra docker container attached to the task container
6) we have structured metadata which is inspectible by tools, to know which tasks a task submitted. note, if turing-complete code submits tasks, we can't so easily reason about it. note though, we could probably reverse-engineer this by intercepting the calls on taskcluster-proxy, but that is ugly.
7) learning curve is consierably less for task authors
8) workers won't require wget/curl/... tools installed, that may operate peculiarly or unexpectedly depending on platform, version, etc
9) we can optimise parallelising task submission requests, take care of retry mechanics etc in a consistent manner

Disadvantages:

1) we have a new feature to implement
2) tasks are only submitted at task completion, not during task execution (and nothing stops a task author from manually submitting to queue if really wanted)

To my mind, the advantages significantly outweigh the disadvantages.
So, I thought I could get away with this bug not being fixed, but in the end, I can't.

For multiple reasons, for bug 1313111, we'll be using a mach command on most build jobs to pull things from tooltool and/or taskcluster artifacts as appropriate. Part of this involves running the taskcluster code, and part of this involves running the optimize_task function, which calls opt_index_search, which does this:

     find_task_id(index_path, use_proxy=bool(os.environ.get('TASK_ID')))

Guess what? Windows taskcluster builders have TASK_ID set but don't have access to the taskcluster proxy. So we're back to square one, where TASK_ID is not the right test for this.
Blocks: 1313111
Comment on attachment 8840241 [details]
Bug 1341934 - Change how taskcluster urls are chosen between proxy and non-proxy.

https://reviewboard.mozilla.org/r/114716/#review133430
Attachment #8840241 - Flags: review?(dustin) → review+
cf. the discussion around bug 1356529, we're not going to need this.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.