1398964 - Retry 'Publish to Balrog' jobs on 'Error calling 'link' for balrogVPNProxy : Gateway Timeout'

Reporter

Description

•

7 years ago

eg https://public-artifacts.taskcluster.net/VerKfca7RNKmKZ49PjLucg/0/public/logs/live_backing.log [taskcluster 2017-09-11 22:13:55.746Z] Task ID: VerKfca7RNKmKZ49PjLucg [taskcluster 2017-09-11 22:13:55.746Z] Worker ID: i-0cd797c59ce3227ab [taskcluster 2017-09-11 22:13:55.746Z] Worker Group: us-east-1 [taskcluster 2017-09-11 22:13:55.746Z] Worker Node Type: r3.xlarge [taskcluster 2017-09-11 22:13:55.746Z] Worker Type: funsize-balrog [taskcluster 2017-09-11 22:13:55.746Z] Public IP: 52.87.236.39 [taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for balrogVPNProxy : Gateway Timeout [taskcluster 2017-09-11 22:44:20.996Z] Unsuccessful task run with exit code: -1 completed in 1825.252 seconds We often hit this once per release. Bug 1394657 shows how to set retry for a -1 exit code.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

7 years ago

Created https://github.com/mozilla-releng/releasetasks/pull/277

Assignee: nobody → nthomas

Priority: -- → P1

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

7 years ago

Merged and deployed to prod.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

7 years ago

This didn't work on https://tools.taskcluster.net/groups/R0ES6YjPQt-rzimWCAHeRQ/tasks/UA0cxuESTFSwLv35wCe64g/runs/0, despite picking up the payload change. dustin says -1 is not a real exit code and means "this never started up", and since onExitStatus is a worker thing (specifically docker-worker) I think this means we're out of luck. Any other idea rail ?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Aki Sasaki (not active)

Comment 4

•

7 years ago

Isn't funsize-balrog a docker-worker? https://tools.taskcluster.net/aws-provisioner/funsize-balrog/

Rail Aliiev [:rail]

Comment 5

•

7 years ago

Hmmmm. Probably we should look at the balrogVpn feature implementation and see if there is something that we can use to make it retry.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

7 years ago

I'm not sure where that lives, but bug 1169454 has been around for quite a while.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

7 years ago

dustin, is there anything we can do to debug the cause of this issue ? Here's a recent example.

Assignee: nthomas → nobody

Flags: needinfo?(dustin)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

7 years ago

eg https://tools.taskcluster.net/groups/deNL0h2DTy6JN_rrrUqY4w/tasks/S3S69WtkQ1ekWx5-TC78Eg/runs/0

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

7 years ago

If the Balrog proxy is anything like the other proxies, then it makes a connection on startup, and if that fails, you get an error like this. Basically docker goes "woah, the container crashed before I could link it!" "Gateway Timeout" sounds like an error from a load balancer of some sort (ZLB?)

Flags: needinfo?(dustin)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

7 years ago

So I think it's https://github.com/taskcluster/docker-worker/blob/master/src/lib/features/balrog_vpn_proxy.js that provides the proxy. We don't see any of the debugging messages there because DEBUG in https://github.com/taskcluster/docker-worker/blob/27241854517b09881df3e7a126b9235cd39a9d84/deploy/template/etc/default/docker-worker gets an empty value at https://github.com/taskcluster/docker-worker/blob/27241854517b09881df3e7a126b9235cd39a9d84/deploy/bin/import-docker-worker-secrets#L66. Entirely possible that I don't know how this all works together, but what is the socially responsible way of debugging this ? Can we set DEBUG='taskcluster-docker-worker:features:balrogVPNProxy' in production ?

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

7 years ago

Those logs would go to papertrail anyway. I'm not sure the debug(..) output would be too useful, and enabling that requires a redeployment of the worker, so not especially easy. Here's the logging we did get: Sep 28 17:53:30 Thu, 28 Sep 2017 22:53:30 GMT taskcluster-docker-worker:features:balrogVPNProxy Error: Gateway Timeout Sep 28 17:53:30 at Request.callback (/home/ubuntu/docker_worker/node_modules/superagent/lib/node/index.js:698:17) Sep 28 17:53:30 at IncomingMessage.<anonymous> (/home/ubuntu/docker_worker/node_modules/superagent/lib/node/index.js:922:12) Sep 28 17:53:30 at emitNone (events.js:91:20) Sep 28 17:53:30 at IncomingMessage.emit (events.js:185:7) Sep 28 17:53:30 at endReadableNT (_stream_readable.js:974:12) Sep 28 17:53:30 at _combinedTickCallback (internal/process/next_tick.js:74:11) Sep 28 17:53:30 at process._tickDomainCallback (internal/process/next_tick.js:122:9) Sep 28 17:53:30 {"type":"task aborted","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-075c20f7dc473f632","workerGroup":"us-west-2","workerType":"funsize-balrog","workerNodeType":"m3.xlarge","taskId":"S3S69WtkQ1ekWx5-TC78Eg","runId":0,"exception":"","error":"\n[taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for balrogVPNProxy : Gateway Timeout\n"} Sep 28 17:53:30 Thu, 28 Sep 2017 22:53:30 GMT docker-worker:states taskId: S3S69WtkQ1ekWx5-TC78Eg at state: killed Sadly, async stacks (like with Twisted) aren't so useful, so it's hard to tell who called Request there, but balrog_vpn_proxy.js is a good guess, since it tries to connect to Balrog. Here's a hypothesis: nginx is up, but the OpenVPN connection is not yet up, when this request comes in, so nginx replies with Bad Gateway. Maybe just catching a 503 error here and retrying would be the right choice? I'm just confused why this would give a failure calling "link" -- typically we see that message when the container itself crashes, which does not appear to be the case here. Wander, what do you think?

Flags: needinfo?(wcosta)

Wander Lairson Costa

Comment 12

•

7 years ago

This should be fixed when [1] gets deployed next week. [1] https://github.com/taskcluster/docker-worker/commit/5b44db12857b0396c088a5c5064d88a2415a0472

Flags: needinfo?(wcosta)

Jordan Lund (:jlund)

Comment 13

•

7 years ago

nthomas: I haven't noticed this lately. Have we hit this in a while? Think we can close given comment 12?

Flags: needinfo?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 14

•

7 years ago

I haven't seen it either, yay for fixes.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Flags: needinfo?(nthomas)

Resolution: --- → FIXED

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Whiteboard: [releaseduty]

Bugzilla

Quick Search

Retry 'Publish to Balrog' jobs on 'Error calling 'link' for balrogVPNProxy : Gateway Timeout'

Categories

(Release Engineering :: Release Automation: Other, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: nthomas, Unassigned)

References

Details

(Whiteboard: [releaseduty])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated