In this push, we had a failure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=def46785a370d60636ac6218a95aa598f90c34ab&selectedJob=114303644 [taskcluster 2017-07-14 10:28:58.060Z] Task ID: WCMGThvHQ6W3kk6wU2Ap2Q [taskcluster 2017-07-14 10:28:58.061Z] Worker ID: i-0b604f3218594f148 [taskcluster 2017-07-14 10:28:58.061Z] Worker Group: us-east-1 [taskcluster 2017-07-14 10:28:58.061Z] Worker Node Type: c4.4xlarge [taskcluster 2017-07-14 10:28:58.061Z] Worker Type: gecko-1-b-linux [taskcluster 2017-07-14 10:28:58.061Z] Public IP: 184.108.40.206 <snip/> ./mach artifact toolchain -v --tooltool-url=http://relengapi/tooltool/ --tooltool-manifest browser/config/tooltool-manifests/linux64/clang.manifest --cache-dir /home/worker/tooltool-cache --retry 5 <snip/> NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f23b810d410>: Failed to establish a new connection: [Errno -2] Name or service not known',) The task was retriggered, and it ran without problems on a different worker in the same region, on the same instance type: [taskcluster 2017-07-14 10:46:09.628Z] Task ID: Xq5NtMfySLujXZIHWRXgxQ [taskcluster 2017-07-14 10:46:09.629Z] Worker ID: i-0566418cc92c90704 [taskcluster 2017-07-14 10:46:09.629Z] Worker Group: us-east-1 [taskcluster 2017-07-14 10:46:09.629Z] Worker Node Type: c4.4xlarge [taskcluster 2017-07-14 10:46:09.629Z] Worker Type: gecko-1-b-linux [taskcluster 2017-07-14 10:46:09.629Z] Public IP: 220.127.116.11 <snip/> ./mach artifact toolchain -v --tooltool-url=http://relengapi/tooltool/ --tooltool-manifest browser/config/tooltool-manifests/linux64/clang.manifest --cache-dir /home/worker/tooltool-cache --retry 5 <snip/> Downloaded artifact to /home/worker/tooltool-cache/52f3fc23f0f5c98050f8b0ac7c92a6752d067582a16f712a5a58074be98975d594f9e36249fc2be7f1cc2ca6d509c663faaf2bea66f949243cc1f41651638ba6 So the question is, why can 'relengapi' host not be resolved on i-0b604f3218594f148 but can be on i-0566418cc92c90704?
Note, in the logs above, I've only shown one failure, but all retries failed too with same message.
Interesting. Neither task has features.relengapi = true, so *neither* should have been able to use the proxy. The bug is that sometimes you can use relengapiProxy when the feature is not specified in the task definition.
Nice spot! So maybe the bug is something like, once a worker has run a task with the feature enabled, it remains enabled for future tasks? We'll have to look into the code .... https://github.com/taskcluster/docker-worker
I tried a bunch of times to run `host relengapi` in a task copied from the second (Xq5N..) task above, just modifying the command to run `host relengapi`: https://tools.taskcluster.net/groups/HUalP7SpT3yQd7-t1ny7tg/tasks/HUalP7SpT3yQd7-t1ny7tg/details Those tasks all ran immediately, so they were running on previously-used workers, yet all of them came up with "Host relengapi not found". Glandium has also noticed this `./mach artifact toolchain` run working successfully on Windows. Which is especially weird since relengapiProxy is a docker container and is completely unsupported on Windows. Pete verified that `relengapi` doesn't resolve to anything on Windows: https://tools.taskcluster.net/groups/ZDEJSCG4Q-aJJ7vbx-TliQ/tasks/ZDEJSCG4Q-aJJ7vbx-TliQ/runs/0/logs/public%2Flogs%2Flive.log So, I have no idea what's going on here. The log says "Downloaded .. to cache ..", which suggests that it didn't just happen to find the file in the tooltool cache.
Ah! The code in python/mozbuild/mozbuild/artifacts.py shows that message appearing even when the file was cached. If it had been downloaded, the logs would also contain some download-progress stuff. So what's happening here is that the files are found in the cache, having been downloaded by a previous task, and `relengapi` is never accessed. On Windows, the same thing - those earlier downloads having been done with the public relengapi interface. Nice to find a solution that doesn't involve leaking docker containers :)
Thanks dustin. That explains a lot! We should probably not use any global caches on the windows workers - I think they were added before the generic worker caches feature existed. By global caches, I mean folders that are readable (and possibly writable) by all task users at a fixed (know) location on the file system. Instead we should use the caches feature of the worker, that makes this more explicit. I'll create a bug for this...
(and not only explicit, also scope-protected)