Closed Bug 1501720 Opened 6 years ago Closed 6 years ago

Inscrutable error building Docker image: "Get http:///var/run/docker.sock/v1.18/images/taskcluster-built/get: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?"

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nalexander, Assigned: wcosta)

References

Details

This has reproduced, and the following task is representative: https://tools.taskcluster.net/groups/UVLGdXn4T5OQfFYzkxDINQ/tasks/P9fWCKaqQuOAtJsHgMoqMg/details I see: + docker save taskcluster-built time="2018-10-24T02:06:27Z" level=fatal msg="Get http:///var/run/docker.sock/v1.18/images/taskcluster-built/get: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?" [taskcluster 2018-10-24 02:06:29.020Z] === Task Finished === [taskcluster 2018-10-24 02:06:29.894Z] Successful task run with exit code: 0 completed in 718.31 seconds Bug 1389035 suggests this is a worker running out of disk space. Could I get confirmation (by inspecting logs in the way that was done on the earlier ticket)? If this is in fact a disk space issue, let's morph this ticket into addressing some of the papercuts here. I see: a) no way to determine what the space limits actually are. My image is very large (8-10Gb) but it's that way because it has lots of stuff in it, to avoid bootstrapping at build time. If we need to change, how do I determine how much to change? b) that task functionally failed. How do we make it actually fail? I can't really tell what level the error is at. Is it in the image-building task, and somehow `docker save taskcluster-built` isn't surfacing an error code? c) can we notice when Docker images are 0 bytes and mark those as failures, rather than waiting for a task to consume them and immediately fail?
I don't see any logging about being out of space. That runs on m3.xlarge which as 80GB of disk, so 8-10GB should be fine. The execution of the linked task overlapped slightly with https://tools.taskcluster.net/groups/bb2mWib1Rbq5JBKlMX93Aw/tasks/bb2mWib1Rbq5JBKlMX93Aw/runs/0 and that, being a Java build, probably used a good bit of space, but it's hard to see it being more than 60GB! A bit of stackoverflow suggests that error occurs when the task can't talk to the docker daemon, or some other error. I don't see any syslog errors from docker during that time frame. That it exits with status 0 is kind of awful, but I suspect it's a docker bug -- your shell invocation of it looks correct. It might be necessary to monitor its output and look for "fatal" or something like that. Regarding (c), the worker doesn't know that the produced artifact is an image, so it's not in a great spot to make that distinction. However, the script could probably check with a quick shell conditional. How reproducible is this? Every time? Often? If it's contention with other tasks, we could certainly set up a dedicated workerType for this project which wouldn't have to share with anything.. Sorry I don't have any great data for you :(
I suspect it is bug in docker due the version used by dind is too old.
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
(In reply to Wander Lairson Costa [:wcosta] from comment #2) > I suspect it is bug in docker due the version used by dind is too old. OK. I tried to push a Dockerfile that would produce a smaller image (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the same error. I see you assigned yourself. Is the plan for you to update dind? Can you explain why I am seeing this and Servo is not? I cloned-and-hacked-up Servo's approach to this, namely https://github.com/servo/servo/blob/master/.taskcluster.yml and https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. To the best of my knowledge nothing I changed would impact dind or Docker itself.
Flags: needinfo?(wcosta)
(In reply to Nick Alexander :nalexander [he/him] from comment #3) > (In reply to Wander Lairson Costa [:wcosta] from comment #2) > > I suspect it is bug in docker due the version used by dind is too old. > > OK. I tried to push a Dockerfile that would produce a smaller image > (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/ > VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the > same error. > > I see you assigned yourself. Is the plan for you to update dind? Can you > explain why I am seeing this and Servo is not? I cloned-and-hacked-up > Servo's approach to this, namely > https://github.com/servo/servo/blob/master/.taskcluster.yml and > https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. > To the best of my knowledge nothing I changed would impact dind or Docker > itself. I have a PR to update docker in the dind service https://github.com/taskcluster/dind-service/pull/5
Flags: needinfo?(wcosta)
(In reply to Wander Lairson Costa [:wcosta] from comment #4) > (In reply to Nick Alexander :nalexander [he/him] from comment #3) > > (In reply to Wander Lairson Costa [:wcosta] from comment #2) > > > I suspect it is bug in docker due the version used by dind is too old. > > > > OK. I tried to push a Dockerfile that would produce a smaller image > > (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/ > > VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the > > same error. > > > > I see you assigned yourself. Is the plan for you to update dind? Can you > > explain why I am seeing this and Servo is not? I cloned-and-hacked-up > > Servo's approach to this, namely > > https://github.com/servo/servo/blob/master/.taskcluster.yml and > > https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. > > To the best of my knowledge nothing I changed would impact dind or Docker > > itself. > > I have a PR to update docker in the dind service > https://github.com/taskcluster/dind-service/pull/5 This is great! I see that PR is merged. Can you update this ticket with the status? When does that merge make it to production, so that I can test and we can try to close this ticket?
Flags: needinfo?(wcosta)
(In reply to Nick Alexander :nalexander [he/him] from comment #5) > (In reply to Wander Lairson Costa [:wcosta] from comment #4) > > (In reply to Nick Alexander :nalexander [he/him] from comment #3) > > > (In reply to Wander Lairson Costa [:wcosta] from comment #2) > > > > I suspect it is bug in docker due the version used by dind is too old. > > > > > > OK. I tried to push a Dockerfile that would produce a smaller image > > > (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/ > > > VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the > > > same error. > > > > > > I see you assigned yourself. Is the plan for you to update dind? Can you > > > explain why I am seeing this and Servo is not? I cloned-and-hacked-up > > > Servo's approach to this, namely > > > https://github.com/servo/servo/blob/master/.taskcluster.yml and > > > https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. > > > To the best of my knowledge nothing I changed would impact dind or Docker > > > itself. > > > > I have a PR to update docker in the dind service > > https://github.com/taskcluster/dind-service/pull/5 > > This is great! I see that PR is merged. Can you update this ticket with > the status? When does that merge make it to production, so that I can test > and we can try to close this ticket? During integration I found an issue, I am preparing a new patch
Flags: needinfo?(wcosta)
I made a bunch of changes to the underlying TC integration (which I just landed), so I retried this to see if anything changed. I now have a task group in a _very_ strange state: https://tools.taskcluster.net/groups/MQoMhW3sS32JEBYfYCMB4Q it says the image task is still running, but half the system thinks it has finished (there's an artifact). There's no docker error string, but I have a 0 byte image again. Something is obviously still wrong -- maybe the same thing -- but now TC seems to be doing a worse job handling the situation.
I don't think anything's changed in TC -- dind never got merged. Here are the logs from that instance: Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"claimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"not superseding","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","message":"no supersederUrl in payload"} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"run task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"task start","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"next claim","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","time":923130.7692307692,"primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-30T23:47:46.402Z"} Oct 30 23:39:45 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"Uploading public/image.tar.lz4","source":"uploadToS3","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"artifactName":"public/image.tar.lz4"} Oct 30 23:43:10 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-30T23:47:46.402Z"} Oct 30 23:43:10 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"reclaimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-31T00:03:10.413Z"} Oct 30 23:43:10 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"next claim","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","time":923360,"primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-31T00:03:10.413Z"} Oct 30 23:58:34 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-31T00:03:10.413Z"} Wander, any idea what's going on here?
Depends on: 1507954
Curious: a job with a very minimal Dockerfile succeeded: https://tools.taskcluster.net/groups/edu6XhdBTg62NCtkIjV5Ug/tasks/fYj0wUw3SGqfOCrNW3O42A/details so that suggests something is happening with the very large images that a-s is using. Perhaps this is as simple as lz4 running out of memory?

Wander: what's our plan here? Is this important and/or common enough to bother fixing now, or should we WONTFIX in favour of a quicker move to generic-worker?

Flags: needinfo?(wcosta)

(In reply to Chris Cooper [:coop] pronoun: he from comment #10)

Wander: what's our plan here? Is this important and/or common enough to bother fixing now, or should we WONTFIX in favour of a quicker move to generic-worker?

I concluded that this was strictly based on image size, so if we can just keep the "your images need to be small" message high in the SEO rankings, that's enough for this.

(In reply to Chris Cooper [:coop] pronoun: he from comment #10)

Wander: what's our plan here? Is this important and/or common enough to
bother fixing now, or should we WONTFIX in favour of a quicker move to
generic-worker?

Upgrading dind wasn't possible because newer versions of dind work very differently. The solution is to setup a host machine with generic-worker, I guess.

Flags: needinfo?(wcosta)
Component: Docker-Worker → Workers
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
See Also: → 1587611
You need to log in before you can comment on or make changes to this bug.