Closed
Bug 1398964
Opened 7 years ago
Closed 7 years ago
Retry 'Publish to Balrog' jobs on 'Error calling 'link' for balrogVPNProxy : Gateway Timeout'
Categories
(Release Engineering :: Release Automation: Other, enhancement, P1)
Release Engineering
Release Automation: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Unassigned)
References
Details
(Whiteboard: [releaseduty])
eg https://public-artifacts.taskcluster.net/VerKfca7RNKmKZ49PjLucg/0/public/logs/live_backing.log
[taskcluster 2017-09-11 22:13:55.746Z] Task ID: VerKfca7RNKmKZ49PjLucg
[taskcluster 2017-09-11 22:13:55.746Z] Worker ID: i-0cd797c59ce3227ab
[taskcluster 2017-09-11 22:13:55.746Z] Worker Group: us-east-1
[taskcluster 2017-09-11 22:13:55.746Z] Worker Node Type: r3.xlarge
[taskcluster 2017-09-11 22:13:55.746Z] Worker Type: funsize-balrog
[taskcluster 2017-09-11 22:13:55.746Z] Public IP: 52.87.236.39
[taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for balrogVPNProxy : Gateway Timeout
[taskcluster 2017-09-11 22:44:20.996Z] Unsuccessful task run with exit code: -1 completed in 1825.252 seconds
We often hit this once per release. Bug 1394657 shows how to set retry for a -1 exit code.
Reporter | ||
Comment 1•7 years ago
|
||
Assignee: nobody → nthomas
Priority: -- → P1
Reporter | ||
Comment 2•7 years ago
|
||
Merged and deployed to prod.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 3•7 years ago
|
||
This didn't work on https://tools.taskcluster.net/groups/R0ES6YjPQt-rzimWCAHeRQ/tasks/UA0cxuESTFSwLv35wCe64g/runs/0, despite picking up the payload change.
dustin says -1 is not a real exit code and means "this never started up", and since onExitStatus is a worker thing (specifically docker-worker) I think this means we're out of luck. Any other idea rail ?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 4•7 years ago
|
||
Isn't funsize-balrog a docker-worker? https://tools.taskcluster.net/aws-provisioner/funsize-balrog/
Comment 5•7 years ago
|
||
Hmmmm. Probably we should look at the balrogVpn feature implementation and see if there is something that we can use to make it retry.
Reporter | ||
Comment 6•7 years ago
|
||
I'm not sure where that lives, but bug 1169454 has been around for quite a while.
Reporter | ||
Comment 7•7 years ago
|
||
dustin, is there anything we can do to debug the cause of this issue ? Here's a recent example.
Assignee: nthomas → nobody
Flags: needinfo?(dustin)
Reporter | ||
Comment 8•7 years ago
|
||
Comment 9•7 years ago
|
||
If the Balrog proxy is anything like the other proxies, then it makes a connection on startup, and if that fails, you get an error like this. Basically docker goes "woah, the container crashed before I could link it!"
"Gateway Timeout" sounds like an error from a load balancer of some sort (ZLB?)
Flags: needinfo?(dustin)
Reporter | ||
Comment 10•7 years ago
|
||
So I think it's https://github.com/taskcluster/docker-worker/blob/master/src/lib/features/balrog_vpn_proxy.js that provides the proxy.
We don't see any of the debugging messages there because DEBUG in https://github.com/taskcluster/docker-worker/blob/27241854517b09881df3e7a126b9235cd39a9d84/deploy/template/etc/default/docker-worker gets an empty value at https://github.com/taskcluster/docker-worker/blob/27241854517b09881df3e7a126b9235cd39a9d84/deploy/bin/import-docker-worker-secrets#L66. Entirely possible that I don't know how this all works together, but what is the socially responsible way of debugging this ?
Can we set DEBUG='taskcluster-docker-worker:features:balrogVPNProxy' in production ?
Comment 11•7 years ago
|
||
Those logs would go to papertrail anyway. I'm not sure the debug(..) output would be too useful, and enabling that requires a redeployment of the worker, so not especially easy. Here's the logging we did get:
Sep 28 17:53:30 Thu, 28 Sep 2017 22:53:30 GMT taskcluster-docker-worker:features:balrogVPNProxy Error: Gateway Timeout Sep 28 17:53:30 at Request.callback (/home/ubuntu/docker_worker/node_modules/superagent/lib/node/index.js:698:17)
Sep 28 17:53:30 at IncomingMessage.<anonymous> (/home/ubuntu/docker_worker/node_modules/superagent/lib/node/index.js:922:12)
Sep 28 17:53:30 at emitNone (events.js:91:20)
Sep 28 17:53:30 at IncomingMessage.emit (events.js:185:7)
Sep 28 17:53:30 at endReadableNT (_stream_readable.js:974:12)
Sep 28 17:53:30 at _combinedTickCallback (internal/process/next_tick.js:74:11)
Sep 28 17:53:30 at process._tickDomainCallback (internal/process/next_tick.js:122:9)
Sep 28 17:53:30 {"type":"task aborted","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-075c20f7dc473f632","workerGroup":"us-west-2","workerType":"funsize-balrog","workerNodeType":"m3.xlarge","taskId":"S3S69WtkQ1ekWx5-TC78Eg","runId":0,"exception":"","error":"\n[taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for balrogVPNProxy : Gateway Timeout\n"}
Sep 28 17:53:30 Thu, 28 Sep 2017 22:53:30 GMT docker-worker:states taskId: S3S69WtkQ1ekWx5-TC78Eg at state: killed
Sadly, async stacks (like with Twisted) aren't so useful, so it's hard to tell who called Request there, but balrog_vpn_proxy.js is a good guess, since it tries to connect to Balrog.
Here's a hypothesis: nginx is up, but the OpenVPN connection is not yet up, when this request comes in, so nginx replies with Bad Gateway. Maybe just catching a 503 error here and retrying would be the right choice?
I'm just confused why this would give a failure calling "link" -- typically we see that message when the container itself crashes, which does not appear to be the case here.
Wander, what do you think?
Flags: needinfo?(wcosta)
Comment 12•7 years ago
|
||
This should be fixed when [1] gets deployed next week.
[1] https://github.com/taskcluster/docker-worker/commit/5b44db12857b0396c088a5c5064d88a2415a0472
Flags: needinfo?(wcosta)
Comment 13•7 years ago
|
||
nthomas: I haven't noticed this lately. Have we hit this in a while? Think we can close given comment 12?
Flags: needinfo?(nthomas)
Reporter | ||
Comment 14•7 years ago
|
||
I haven't seen it either, yay for fixes.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Flags: needinfo?(nthomas)
Resolution: --- → FIXED
Updated•7 years ago
|
Whiteboard: [releaseduty]
You need to log in
before you can comment on or make changes to this bug.
Description
•