Closed Bug 1381000 Opened 7 years ago Closed 7 years ago

docker-worker: relengapi is sometimes available when the feature is not enabled

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: pmoore, Unassigned)

Details

In this push, we had a failure:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=def46785a370d60636ac6218a95aa598f90c34ab&selectedJob=114303644


[taskcluster 2017-07-14 10:28:58.060Z] Task ID: WCMGThvHQ6W3kk6wU2Ap2Q
[taskcluster 2017-07-14 10:28:58.061Z] Worker ID: i-0b604f3218594f148
[taskcluster 2017-07-14 10:28:58.061Z] Worker Group: us-east-1
[taskcluster 2017-07-14 10:28:58.061Z] Worker Node Type: c4.4xlarge
[taskcluster 2017-07-14 10:28:58.061Z] Worker Type: gecko-1-b-linux
[taskcluster 2017-07-14 10:28:58.061Z] Public IP: 54.175.214.35
<snip/>
./mach artifact toolchain -v --tooltool-url=http://relengapi/tooltool/ --tooltool-manifest browser/config/tooltool-manifests/linux64/clang.manifest --cache-dir /home/worker/tooltool-cache --retry 5
<snip/>
NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f23b810d410>: Failed to establish a new connection: [Errno -2] Name or service not known',)




The task was retriggered, and it ran without problems on a different worker in the same region, on the same instance type:


[taskcluster 2017-07-14 10:46:09.628Z] Task ID: Xq5NtMfySLujXZIHWRXgxQ
[taskcluster 2017-07-14 10:46:09.629Z] Worker ID: i-0566418cc92c90704
[taskcluster 2017-07-14 10:46:09.629Z] Worker Group: us-east-1
[taskcluster 2017-07-14 10:46:09.629Z] Worker Node Type: c4.4xlarge
[taskcluster 2017-07-14 10:46:09.629Z] Worker Type: gecko-1-b-linux
[taskcluster 2017-07-14 10:46:09.629Z] Public IP: 34.205.65.140
<snip/>
./mach artifact toolchain -v --tooltool-url=http://relengapi/tooltool/ --tooltool-manifest browser/config/tooltool-manifests/linux64/clang.manifest --cache-dir /home/worker/tooltool-cache --retry 5
<snip/>
Downloaded artifact to /home/worker/tooltool-cache/52f3fc23f0f5c98050f8b0ac7c92a6752d067582a16f712a5a58074be98975d594f9e36249fc2be7f1cc2ca6d509c663faaf2bea66f949243cc1f41651638ba6


So the question is, why can 'relengapi' host not be resolved on i-0b604f3218594f148 but can be on i-0566418cc92c90704?
Note, in the logs above, I've only shown one failure, but all retries failed too with same message.
Interesting.  Neither task has features.relengapi = true, so *neither* should have been able to use the proxy.  The bug is that sometimes you can use relengapiProxy when the feature is not specified in the task definition.
Summary: docker-worker: Host relengapi not resolved on some workers of a worker type where other workers resolve it ok → docker-worker: relengapi is sometimes available when the feature is not enabled
Nice spot! So maybe the bug is something like, once a worker has run a task with the feature enabled, it remains enabled for future tasks? We'll have to look into the code ....

https://github.com/taskcluster/docker-worker
I tried a bunch of times to run `host relengapi` in a task copied from the second (Xq5N..) task above, just modifying the command to run `host relengapi`:
  https://tools.taskcluster.net/groups/HUalP7SpT3yQd7-t1ny7tg/tasks/HUalP7SpT3yQd7-t1ny7tg/details
Those tasks all ran immediately, so they were running on previously-used workers, yet all of them came up with "Host relengapi not found".

Glandium has also noticed this `./mach artifact toolchain` run working successfully on Windows.  Which is especially weird since relengapiProxy is a docker container and is completely unsupported on Windows.  Pete verified that `relengapi` doesn't resolve to anything on Windows:
  https://tools.taskcluster.net/groups/ZDEJSCG4Q-aJJ7vbx-TliQ/tasks/ZDEJSCG4Q-aJJ7vbx-TliQ/runs/0/logs/public%2Flogs%2Flive.log

So, I have no idea what's going on here.  The log says "Downloaded .. to cache ..", which suggests that it didn't just happen to find the file in the tooltool cache.
Ah!  The code in python/mozbuild/mozbuild/artifacts.py shows that message appearing even when the file was cached.  If it had been downloaded, the logs would also contain some download-progress stuff.  So what's happening here is that the files are found in the cache, having been downloaded by a previous task, and `relengapi` is never accessed.

On Windows, the same thing - those earlier downloads having been done with the public relengapi interface.

Nice to find a solution that doesn't involve leaking docker containers :)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
Thanks dustin. That explains a lot!

We should probably not use any global caches on the windows workers - I think they were added before the generic worker caches feature existed. By global caches, I mean folders that are readable (and possibly writable) by all task users at a fixed (know) location on the file system. Instead we should use the caches feature of the worker, that makes this more explicit. I'll create a bug for this...
(and not only explicit, also scope-protected)
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.