Closed Bug 1240477 Opened 4 years ago Closed 4 years ago

[taskcluster:error] Failure to properly start execution environment. HTTP code is 500 which indicates error: server error - Cannot start container <hash>: [8] System error: read parent: connection reset by peer

Categories

(Firefox Build System :: Task Configuration, task, major)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: garndt)

References

()

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

have seen this before but this time it seems its more frequent

HTTP code is 500 which indicates error: server error - Cannot start container

like:
https://treeherder.mozilla.org/logviewer.html#?job_id=3783463&repo=b2g-inbound

retrigger seems to work at some time but its frequent
Maybe this is a problem with the docker daemon on the worker? I'm reading it as a docker command is issued, which should cause an http connection to be opened with the docker daemon running on the localhost, and this connection drops, e.g. perhaps an instability with the daemon itself.

I checked one of the failing tasks which was running on worker type `emulator-jb-debug`, and found that the EC2 AMI has not been updated for this worker type since last Wednesday (13 Jan) which is five days ago, so it doesn't seem to be related to an updated AMI, if this problem only has only been getting worse today. So I'm out of ideas for now.

Let's see what our resident expert Greg says when he gets in later...
Flags: needinfo?(garndt)
I'm not immediately sure yet what started causing this but looking at the logs, it's only affecting the emulator-jb and emulator-kk builders, none of the other builders that are using that same AMI.  It also seems to just affect the first time that image was downloaded, but the next task that an affected worker claims completes just fine.

I'm not immediately sure about this yet but definitely will keep my eyes open.
Flags: needinfo?(garndt)
This spiked suddenly today for some reason.
Severity: normal → major
Summary: Intermittent HTTP code is 500 which indicates error: server error - Cannot start container → HTTP code is 500 which indicates error: server error - Cannot start container <hash>: [8] System error: read parent: connection reset by peer
More info from papertrail:

 Jan 21 01:12:03 docker-worker.aws-provisioner.us-west-1b.ami-e5c6b385.m3-2xlarge.i-4aece5f8 docker-worker: {"type":"error starting container","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-4aece5f8","workerGroup":"us-west-1b","workerType":"emulator-jb-debug","workerNodeType":"m3.2xlarge","taskId":"TOx36Xb4QOaXRgn5d69IpA","runId":0,"error":"Error: HTTP code is 500 which indicates error: server error - Cannot start container ca8411f785cf22f386e2401bd992949a85c5d52decdfd71f3260ae80fef28659: [8] System error: read parent: connection reset by peer\n\n    at /home/ubuntu/docker_worker/node_modules/dockerode/node_modules/docker-modem/lib/modem.js:229:17\n    at getCause (/home/ubuntu/docker_worker/node_modules/dockerode/node_modules/docker-modem/lib/modem.js:257:7)\n    at [object Object].Modem.buildPayload (/home/ubuntu/docker_worker/node_modules/dockerode/node_modules/docker-modem/lib/modem.js:228:5)\n    at IncomingMessage.<anonymous> (/home/ubuntu/docker_worker/node_modules/dockerode/node_modules/docker-modem/lib/modem.js:204:14)\n    at IncomingMessage.emit (events.js:129:20)\n    at _stream_read
Jan 21 01:12:06 docker-worker.aws-provisioner.us-west-1b.ami-e5c6b385.m3-2xlarge.i-4aece5f8 docker-worker: Thu, 21 Jan 2016 00:12:06 GMT taskcluster-docker-worker:garbageCollector marked ca8411f785cf22f386e2401bd992949a85c5d52decdfd71f3260ae80fef28659
Jan 21 01:12:41 docker-worker.aws-provisioner.us-west-1b.ami-e5c6b385.m3-2xlarge.i-4aece5f8 docker-worker: {"type":"container removed","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-4aece5f8","workerGroup":"us-west-1b","workerType":"emulator-jb-debug","workerNodeType":"m3.2xlarge","container":"ca8411f785cf22f386e2401bd992949a85c5d52decdfd71f3260ae80fef28659","caches":["level-3-b2g-inbound-emulator-jb-debug::139ae748-b695-4edc-959b-07f62e230961","level-3-b2g-inbound-emulator-jb-debug-objdir-gecko::35d44c37-23ef-40bf-bc2d-a3a301e27015","level-3-b2g-inbound-tc-vcs::c560aee5-b15d-422d-ab6d-eff3d0c45600","tooltool-cache::c693bca0-fd98-4f25-bcfd-7e3f8f6d807c"]} 

My guess is that for these jobs, we are starting a linked container for some feature enabled in the task. For whatever reason (probably a docker bug itself, and probably https://github.com/docker/docker/issues/14203) the request to start the docker container fails due to the docker daemon dying. The docker command connects over http to the docker daemon running on localhost, the daemon goes about its business, dies, and then the docker command reports back that it lost connectivity to the docker daemon.

The fix in comment 7 might be a temporary workaround until issue 14203 is resolved (assuming that is the root cause).

A whole bunch of assumptions here, but this would be my starting point.
So these jobs aren't using a feature with a linked container, but rather the following image:
https://queue.taskcluster.net/v1/task/bRNlznEqR5eXxo6zwjLOEw/runs/0/artifacts/public/image.tar

I'll download this image and see if it can consistently be started up in docker ok. If not, I'll try adding an env var with some spaces, and see if that fixes it!!
No problem running this locally....

$ curl -L https://queue.taskcluster.net/v1/task/bRNlznEqR5eXxo6zwjLOEw/runs/0/artifacts/public/image.tar > image.tar
$ docker images > old
$ docker load -i image.tar
$ docker images > new
$ diff old new
$ docker run -t mozilla-central:e02530aea30953299a7f2f40f2e5b308ea86cea19eb779b8fa82580b338e2028 /bin/bash -c "echo 'hello'"
Locally I have:
Docker version 1.6.2, build 7c8fca2
So far this is where I stand with looking into it...

docker-worker changes:
https://github.com/taskcluster/docker-worker/compare/1bd0984d5dba9427b2ae7ed9460de77c11f4e38a...ed3b79ab9ee68020b47ba8ae322453af1db60073

Nothing there seems to obviously break things.  Same base ami is used for these instances.  Changes were not related to docker.  (by the way this is a problem with the docker daemon starting the container, not the worker itself)

Looking at the docker logs on a host, I see similar to what has already been reported as the error here:

34mINFO[0m[5659] +job log(create, <container id>, b14a218fd7dbf5d4262659942e7e9805:latest) 
[34mINFO[0m[5659] -job log(create, <container id>, b14a218fd7dbf5d4262659942e7e9805:latest) = OK (0) 
[34mINFO[0m[5659] -job create() = OK (0)                       
[34mINFO[0m[5659] POST /containers/<container id>/attach?stream=true&stdout=true&stderr=true 
[34mINFO[0m[5659] +job container_inspect(<container id>) 
[34mINFO[0m[5659] -job container_inspect(<container id>) = OK (0) 
[34mINFO[0m[5659] +job attach(<container id>) 
[34mINFO[0m[5659] POST /containers/<container id>/start 
[34mINFO[0m[5659] +job start(<container id>) 
[34mINFO[0m[5659] +job allocate_interface(<container id>) 
[34mINFO[0m[5659] -job allocate_interface(<container id>) = OK (0) 
[34mINFO[0m[5659] +job log(start, <container id>, b14a218fd7dbf5d4262659942e7e9805:latest) 
[34mINFO[0m[5659] -job log(start, <container id>, b14a218fd7dbf5d4262659942e7e9805:latest) = OK (0) 
[33mWARN[0m[5659] signal: killed                               
[34mINFO[0m[5659] -job attach(<container id>) = OK (0) 
[34mINFO[0m[5659] +job release_interface(<container id>) 
[34mINFO[0m[5659] -job release_interface(<container id>) = OK (0) 
[34mINFO[0m[5659] +job release_interface(<container id>) 
[34mINFO[0m[5659] -job release_interface(<container id>) = OK (0) 
[34mINFO[0m[5659] +job log(die, <container id>, b14a218fd7dbf5d4262659942e7e9805:latest) 
[34mINFO[0m[5659] -job log(die, <container id>, b14a218fd7dbf5d4262659942e7e9805:latest) = OK (0) 
Cannot start container <container id>: [8] System error: read parent: connection reset by peer
[34mINFO[0m[5659] -job start(<container id>) = ERR (1) 

The particular system I'm looking at already had the image and completed 2 tasks prior to this failure using that same image:
https://papertrailapp.com/systems/169543023/events?q=%22claim+task%22+OR+%22error+starting%22+OR+%22task+resolved%22

I have found other systems where they completed a couple of tasks, failed one because of this error, and then started another task successfully.  Perhaps something is killing these containers.


We could attempt to update docker to a more recent version, run some stress tests against that and see where we stand.  So far I can't see any obvious change that would have caused a spike in these failures.
One thing that just ocurred to me as I was looking at these is that it seems to be only the kk-debug and jb-debug worker types...  These same images are used on the other emulator workers, with the same types of ec2 instance types, and the same ec2 AMI... odd
Also, it seems to be only happening on b2g-inbound, at least from the tasks I see here: 
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1240477&startday=2016-01-10&endday=2016-01-21&tree=trunk
It seems that this issue started around the same time as this push, but these commits should be on other trees which are not having a problem, and there is nothing in these pushes I could see to cause something like this:

https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&revision=ea63df265d29
Comment on attachment 8711085 [details]
MozReview Request: Bug 1240477 - Add project environment variable to emulator kk/jb tasks r=wcosta

https://reviewboard.mozilla.org/r/31969/#review28711
Attachment #8711085 - Flags: review?(wcosta) → review+
https://hg.mozilla.org/integration/b2g-inbound/rev/3957a325013c029e68c53c09205ccb602ea923ba
Bug 1240477 - Add environment variable to change docker run length r=wcosta
So I think this just might have been fixed (comment about the docker bug is in the change).

Looking at these results, if the problem was still around, these would have failed almost immediately after starting, but all of these are running and cloning/building.
Assignee: nobody → garndt
Component: General → Task Configuration
I remember laughing at the Github issue when I first saw it. And it comes back to bite me \o/

Thanks Greg :)
https://hg.mozilla.org/mozilla-central/rev/3957a325013c
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Some fun: I merged the *fix* for this to fx-team, which was not affected by the problem, and now that it has the fix for the problem, rather than being fixed it has the problem.

Side note: if you can, make your error messages which have variable data in them just a teeny bit longer - treeherder searches for intermittent-failure bugs with a summary including "[taskcluster:error] HTTP code is 500 which indicates error: server error - Cannot start container 31" so we're two characters away from being able to have a starrable summary for that failure line.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: HTTP code is 500 which indicates error: server error - Cannot start container <hash>: [8] System error: read parent: connection reset by peer → [taskcluster:error] Failure to properly start execution environment. HTTP code is 500 which indicates error: server error - Cannot start container <hash>: [8] System error: read parent: connection reset by peer
(In reply to Phil Ringnalda (:philor) from comment #25)
> Some fun: I merged the *fix* for this to fx-team, which was not affected by
> the problem, and now that it has the fix for the problem, rather than being
> fixed it has the problem.
> 
> Side note: if you can, make your error messages which have variable data in
> them just a teeny bit longer - treeherder searches for intermittent-failure
> bugs with a summary including "[taskcluster:error] HTTP code is 500 which
> indicates error: server error - Cannot start container 31" so we're two
> characters away from being able to have a starrable summary for that failure
> line.

indeed the problems moved from b2g-i to fx-team now :(
I will be looking at this today.  I should be able to submit some task graphs testing changes without needing to push to the branches.  I'll test it against try, m-c, m-i, fx-team, and b2g-i
Here is a try push that's green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=191d72d168eb

fx-team: https://tools.taskcluster.net/task-graph-inspector/#axppRBgCRTWqjWxNXydR_Q/
b2g-inbound: https://tools.taskcluster.net/task-graph-inspector/#To_FDgdNSOOHu7TqxQLbPA/
mozilla-inbound: https://tools.taskcluster.net/task-graph-inspector/#eno5AthaSIez9kEm68ylXg
mozilla-central: https://tools.taskcluster.net/task-graph-inspector/#LXZzUrk3S4ugZllXdzE7Ow/


I don't think this should break anything but as I'm seeing that it requires a specific payload that sometimes isn't the same when pushing to try or running the taskcluster-graph target directly.
Comment on attachment 8711085 [details]
MozReview Request: Bug 1240477 - Add project environment variable to emulator kk/jb tasks r=wcosta

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/31969/diff/1-2/
Attachment #8711085 - Attachment description: MozReview Request: Bug 1240477 - Add environment variable to change docker run length r=wcosta → MozReview Request: Bug 1240477 - Add project environment variable to emulator kk/jb tasks r=wcosta
It appears that this has been fixed by multiple pieces falling into place.  Last reported event was more than a month ago.  Resolving for now.
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.