Closed Bug 1398964 Opened 7 years ago Closed 7 years ago

Retry 'Publish to Balrog' jobs on 'Error calling 'link' for balrogVPNProxy : Gateway Timeout'

Categories

(Release Engineering :: Release Automation: Other, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

(Whiteboard: [releaseduty])

eg https://public-artifacts.taskcluster.net/VerKfca7RNKmKZ49PjLucg/0/public/logs/live_backing.log [taskcluster 2017-09-11 22:13:55.746Z] Task ID: VerKfca7RNKmKZ49PjLucg [taskcluster 2017-09-11 22:13:55.746Z] Worker ID: i-0cd797c59ce3227ab [taskcluster 2017-09-11 22:13:55.746Z] Worker Group: us-east-1 [taskcluster 2017-09-11 22:13:55.746Z] Worker Node Type: r3.xlarge [taskcluster 2017-09-11 22:13:55.746Z] Worker Type: funsize-balrog [taskcluster 2017-09-11 22:13:55.746Z] Public IP: 52.87.236.39 [taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for balrogVPNProxy : Gateway Timeout [taskcluster 2017-09-11 22:44:20.996Z] Unsuccessful task run with exit code: -1 completed in 1825.252 seconds We often hit this once per release. Bug 1394657 shows how to set retry for a -1 exit code.
Assignee: nobody → nthomas
Priority: -- → P1
Merged and deployed to prod.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
This didn't work on https://tools.taskcluster.net/groups/R0ES6YjPQt-rzimWCAHeRQ/tasks/UA0cxuESTFSwLv35wCe64g/runs/0, despite picking up the payload change. dustin says -1 is not a real exit code and means "this never started up", and since onExitStatus is a worker thing (specifically docker-worker) I think this means we're out of luck. Any other idea rail ?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hmmmm. Probably we should look at the balrogVpn feature implementation and see if there is something that we can use to make it retry.
I'm not sure where that lives, but bug 1169454 has been around for quite a while.
dustin, is there anything we can do to debug the cause of this issue ? Here's a recent example.
Assignee: nthomas → nobody
Flags: needinfo?(dustin)
If the Balrog proxy is anything like the other proxies, then it makes a connection on startup, and if that fails, you get an error like this. Basically docker goes "woah, the container crashed before I could link it!" "Gateway Timeout" sounds like an error from a load balancer of some sort (ZLB?)
Flags: needinfo?(dustin)
So I think it's https://github.com/taskcluster/docker-worker/blob/master/src/lib/features/balrog_vpn_proxy.js that provides the proxy. We don't see any of the debugging messages there because DEBUG in https://github.com/taskcluster/docker-worker/blob/27241854517b09881df3e7a126b9235cd39a9d84/deploy/template/etc/default/docker-worker gets an empty value at https://github.com/taskcluster/docker-worker/blob/27241854517b09881df3e7a126b9235cd39a9d84/deploy/bin/import-docker-worker-secrets#L66. Entirely possible that I don't know how this all works together, but what is the socially responsible way of debugging this ? Can we set DEBUG='taskcluster-docker-worker:features:balrogVPNProxy' in production ?
Those logs would go to papertrail anyway. I'm not sure the debug(..) output would be too useful, and enabling that requires a redeployment of the worker, so not especially easy. Here's the logging we did get: Sep 28 17:53:30 Thu, 28 Sep 2017 22:53:30 GMT taskcluster-docker-worker:features:balrogVPNProxy Error: Gateway Timeout Sep 28 17:53:30 at Request.callback (/home/ubuntu/docker_worker/node_modules/superagent/lib/node/index.js:698:17) Sep 28 17:53:30 at IncomingMessage.<anonymous> (/home/ubuntu/docker_worker/node_modules/superagent/lib/node/index.js:922:12) Sep 28 17:53:30 at emitNone (events.js:91:20) Sep 28 17:53:30 at IncomingMessage.emit (events.js:185:7) Sep 28 17:53:30 at endReadableNT (_stream_readable.js:974:12) Sep 28 17:53:30 at _combinedTickCallback (internal/process/next_tick.js:74:11) Sep 28 17:53:30 at process._tickDomainCallback (internal/process/next_tick.js:122:9) Sep 28 17:53:30 {"type":"task aborted","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-075c20f7dc473f632","workerGroup":"us-west-2","workerType":"funsize-balrog","workerNodeType":"m3.xlarge","taskId":"S3S69WtkQ1ekWx5-TC78Eg","runId":0,"exception":"","error":"\n[taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for balrogVPNProxy : Gateway Timeout\n"} Sep 28 17:53:30 Thu, 28 Sep 2017 22:53:30 GMT docker-worker:states taskId: S3S69WtkQ1ekWx5-TC78Eg at state: killed Sadly, async stacks (like with Twisted) aren't so useful, so it's hard to tell who called Request there, but balrog_vpn_proxy.js is a good guess, since it tries to connect to Balrog. Here's a hypothesis: nginx is up, but the OpenVPN connection is not yet up, when this request comes in, so nginx replies with Bad Gateway. Maybe just catching a 503 error here and retrying would be the right choice? I'm just confused why this would give a failure calling "link" -- typically we see that message when the container itself crashes, which does not appear to be the case here. Wander, what do you think?
Flags: needinfo?(wcosta)
Flags: needinfo?(wcosta)
nthomas: I haven't noticed this lately. Have we hit this in a while? Think we can close given comment 12?
Flags: needinfo?(nthomas)
I haven't seen it either, yay for fixes.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Flags: needinfo?(nthomas)
Resolution: --- → FIXED
Whiteboard: [releaseduty]
You need to log in before you can comment on or make changes to this bug.