1501720 - Inscrutable error building Docker image: "Get http:///var/run/docker.sock/v1.18/images/taskcluster-built/get: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?"

Reporter

Description

•

6 years ago

This has reproduced, and the following task is representative: https://tools.taskcluster.net/groups/UVLGdXn4T5OQfFYzkxDINQ/tasks/P9fWCKaqQuOAtJsHgMoqMg/details I see: + docker save taskcluster-built time="2018-10-24T02:06:27Z" level=fatal msg="Get http:///var/run/docker.sock/v1.18/images/taskcluster-built/get: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?" [taskcluster 2018-10-24 02:06:29.020Z] === Task Finished === [taskcluster 2018-10-24 02:06:29.894Z] Successful task run with exit code: 0 completed in 718.31 seconds Bug 1389035 suggests this is a worker running out of disk space. Could I get confirmation (by inspecting logs in the way that was done on the earlier ticket)? If this is in fact a disk space issue, let's morph this ticket into addressing some of the papercuts here. I see: a) no way to determine what the space limits actually are. My image is very large (8-10Gb) but it's that way because it has lots of stuff in it, to avoid bootstrapping at build time. If we need to change, how do I determine how much to change? b) that task functionally failed. How do we make it actually fail? I can't really tell what level the error is at. Is it in the image-building task, and somehow `docker save taskcluster-built` isn't surfacing an error code? c) can we notice when Docker images are 0 bytes and mark those as failures, rather than waiting for a task to consume them and immediately fail?

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

6 years ago

I don't see any logging about being out of space. That runs on m3.xlarge which as 80GB of disk, so 8-10GB should be fine. The execution of the linked task overlapped slightly with https://tools.taskcluster.net/groups/bb2mWib1Rbq5JBKlMX93Aw/tasks/bb2mWib1Rbq5JBKlMX93Aw/runs/0 and that, being a Java build, probably used a good bit of space, but it's hard to see it being more than 60GB! A bit of stackoverflow suggests that error occurs when the task can't talk to the docker daemon, or some other error. I don't see any syslog errors from docker during that time frame. That it exits with status 0 is kind of awful, but I suspect it's a docker bug -- your shell invocation of it looks correct. It might be necessary to monitor its output and look for "fatal" or something like that. Regarding (c), the worker doesn't know that the produced artifact is an image, so it's not in a great spot to make that distinction. However, the script could probably check with a quick shell conditional. How reproducible is this? Every time? Often? If it's contention with other tasks, we could certainly set up a dedicated workerType for this project which wouldn't have to share with anything.. Sorry I don't have any great data for you :(

Wander Lairson Costa

Assignee

Comment 2

•

6 years ago

I suspect it is bug in docker due the version used by dind is too old.

Wander Lairson Costa

Assignee

Updated

•

6 years ago

Assignee: nobody → wcosta

Status: NEW → ASSIGNED

Nick Alexander :nalexander [he/him]

Reporter

Comment 3

•

6 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #2) > I suspect it is bug in docker due the version used by dind is too old. OK. I tried to push a Dockerfile that would produce a smaller image (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the same error. I see you assigned yourself. Is the plan for you to update dind? Can you explain why I am seeing this and Servo is not? I cloned-and-hacked-up Servo's approach to this, namely https://github.com/servo/servo/blob/master/.taskcluster.yml and https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. To the best of my knowledge nothing I changed would impact dind or Docker itself.

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 4

•

6 years ago

(In reply to Nick Alexander :nalexander [he/him] from comment #3) > (In reply to Wander Lairson Costa [:wcosta] from comment #2) > > I suspect it is bug in docker due the version used by dind is too old. > > OK. I tried to push a Dockerfile that would produce a smaller image > (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/ > VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the > same error. > > I see you assigned yourself. Is the plan for you to update dind? Can you > explain why I am seeing this and Servo is not? I cloned-and-hacked-up > Servo's approach to this, namely > https://github.com/servo/servo/blob/master/.taskcluster.yml and > https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. > To the best of my knowledge nothing I changed would impact dind or Docker > itself. I have a PR to update docker in the dind service https://github.com/taskcluster/dind-service/pull/5

Flags: needinfo?(wcosta)

Nick Alexander :nalexander [he/him]

Reporter

Comment 5

•

6 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #4) > (In reply to Nick Alexander :nalexander [he/him] from comment #3) > > (In reply to Wander Lairson Costa [:wcosta] from comment #2) > > > I suspect it is bug in docker due the version used by dind is too old. > > > > OK. I tried to push a Dockerfile that would produce a smaller image > > (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/ > > VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the > > same error. > > > > I see you assigned yourself. Is the plan for you to update dind? Can you > > explain why I am seeing this and Servo is not? I cloned-and-hacked-up > > Servo's approach to this, namely > > https://github.com/servo/servo/blob/master/.taskcluster.yml and > > https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. > > To the best of my knowledge nothing I changed would impact dind or Docker > > itself. > > I have a PR to update docker in the dind service > https://github.com/taskcluster/dind-service/pull/5 This is great! I see that PR is merged. Can you update this ticket with the status? When does that merge make it to production, so that I can test and we can try to close this ticket?

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 6

•

6 years ago

(In reply to Nick Alexander :nalexander [he/him] from comment #5) > (In reply to Wander Lairson Costa [:wcosta] from comment #4) > > (In reply to Nick Alexander :nalexander [he/him] from comment #3) > > > (In reply to Wander Lairson Costa [:wcosta] from comment #2) > > > > I suspect it is bug in docker due the version used by dind is too old. > > > > > > OK. I tried to push a Dockerfile that would produce a smaller image > > > (https://tools.taskcluster.net/groups/Jpc7jOTJTpS0FK2aPWP48A/tasks/ > > > VUtaSLWMSiuZ9WMGvQFlMQ/runs/0/logs/public%2Flogs%2Flive.log) and got the > > > same error. > > > > > > I see you assigned yourself. Is the plan for you to update dind? Can you > > > explain why I am seeing this and Servo is not? I cloned-and-hacked-up > > > Servo's approach to this, namely > > > https://github.com/servo/servo/blob/master/.taskcluster.yml and > > > https://github.com/servo/servo/blob/master/etc/taskcluster/decisionlib.py. > > > To the best of my knowledge nothing I changed would impact dind or Docker > > > itself. > > > > I have a PR to update docker in the dind service > > https://github.com/taskcluster/dind-service/pull/5 > > This is great! I see that PR is merged. Can you update this ticket with > the status? When does that merge make it to production, so that I can test > and we can try to close this ticket? During integration I found an issue, I am preparing a new patch

Flags: needinfo?(wcosta)

Nick Alexander :nalexander [he/him]

Reporter

Comment 7

•

6 years ago

I made a bunch of changes to the underlying TC integration (which I just landed), so I retried this to see if anything changed. I now have a task group in a _very_ strange state: https://tools.taskcluster.net/groups/MQoMhW3sS32JEBYfYCMB4Q it says the image task is still running, but half the system thinks it has finished (there's an artifact). There's no docker error string, but I have a 0 byte image again. Something is obviously still wrong -- maybe the same thing -- but now TC seems to be doing a worse job handling the situation.

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

6 years ago

I don't think anything's changed in TC -- dind never got merged. Here are the logs from that instance: Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"claimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"not superseding","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","message":"no supersederUrl in payload"} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"run task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"task start","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0} Oct 30 23:27:46 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"next claim","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","time":923130.7692307692,"primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-30T23:47:46.402Z"} Oct 30 23:39:45 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"Uploading public/image.tar.lz4","source":"uploadToS3","taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"artifactName":"public/image.tar.lz4"} Oct 30 23:43:10 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-30T23:47:46.402Z"} Oct 30 23:43:10 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"reclaimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-31T00:03:10.413Z"} Oct 30 23:43:10 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"next claim","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","time":923360,"primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-31T00:03:10.413Z"} Oct 30 23:58:34 docker-worker.aws-provisioner.us-west-2b.ami-6f401817.c5-9xlarge.i-0214eb1b81d810c7b docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0214eb1b81d810c7b","workerGroup":"us-west-2","workerType":"application-services-r","workerNodeType":"c5.9xlarge","primaryTaskId":"TYdlu2feSOy-p66j1h09Kw","primaryRunId":0,"taskId":"TYdlu2feSOy-p66j1h09Kw","runId":0,"takenUntil":"2018-10-31T00:03:10.413Z"} Wander, any idea what's going on here?

Nick Alexander :nalexander [he/him]

Reporter

Updated

•

6 years ago

Depends on: 1507954

Nick Alexander :nalexander [he/him]

Reporter

Comment 9

•

6 years ago

Curious: a job with a very minimal Dockerfile succeeded: https://tools.taskcluster.net/groups/edu6XhdBTg62NCtkIjV5Ug/tasks/fYj0wUw3SGqfOCrNW3O42A/details so that suggests something is happening with the very large images that a-s is using. Perhaps this is as simple as lz4 running out of memory?

Johan Lorenzo [:jlorenzo]

Updated

•

6 years ago

Blocks: 1513947

Chris Cooper [:coop] (he/him)

Comment 10

•

6 years ago

Wander: what's our plan here? Is this important and/or common enough to bother fixing now, or should we WONTFIX in favour of a quicker move to generic-worker?

Flags: needinfo?(wcosta)

Nick Alexander :nalexander [he/him]

Reporter

Comment 11

•

6 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #10)

Wander: what's our plan here? Is this important and/or common enough to bother fixing now, or should we WONTFIX in favour of a quicker move to generic-worker?

I concluded that this was strictly based on image size, so if we can just keep the "your images need to be small" message high in the SEO rankings, that's enough for this.

Wander Lairson Costa

Assignee

Comment 12

•

6 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #10)

Wander: what's our plan here? Is this important and/or common enough to
bother fixing now, or should we WONTFIX in favour of a quicker move to
generic-worker?

Upgrading dind wasn't possible because newer versions of dind work very differently. The solution is to setup a host machine with generic-worker, I guess.

Flags: needinfo?(wcosta)

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Docker-Worker → Workers

Wander Lairson Costa

Assignee

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX

Mitchell Hentges [:mhentges] 🦀

Updated

•

5 years ago

Bugzilla

Inscrutable error building Docker image: "Get http:///var/run/docker.sock/v1.18/images/taskcluster-built/get: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?"

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: nalexander, Assigned: wcosta)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Updated

Updated

Updated