Closed Bug 1187937 Opened 9 years ago Closed 9 years ago

docker-worker fails to initialize with "Gateway Time-out"

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rail, Assigned: garndt)

References

Details

(Whiteboard: [balrog-vpn-proxy])

Attachments

(1 file)

Increase timeouts 9 years ago Rail Aliiev [:rail] 59 bytes, text/x-github-pull-request	garndt : review+	Details \| Review

Rail Aliiev [:rail]

Reporter

Description

•

9 years ago

Similar to Bug #1187934, but different signature. Sounds like it's trying to initialize the link, but fails. Probably should be an infra error, so TC reschedules the task. [taskcluster] taskId: IiFbGm0zS8qWPAlM44d7oA, workerId: i-46838995 [taskcluster] Error: Task was aborted because states could not be created successfully. Error: Error: Gateway Time-out [taskcluster] Unsuccessful task run with exit code: -1 completed in 923.868 seconds

Rail Aliiev [:rail]

Reporter

Updated

•

9 years ago

Depends on: 1187960

Rail Aliiev [:rail]

Reporter

Updated

•

9 years ago

No longer depends on: 1187960

Rail Aliiev [:rail]

Reporter

Comment 1

•

9 years ago

This is #1 failure for funsize now. Tens a day, see https://treeherder.allizom.org/#/jobs?repo=mozilla-aurora&exclusion_profile=false&filter-searchStr=funsize Any suggestions regarding what and how I should tackle to make this retry better?

Greg Arndt [:garndt]

Assignee

Comment 2

•

9 years ago

I'm going to ni? myself on this so I can check it out for you.

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Flags: needinfo?(garndt)

Whiteboard: [balrog-vpn-proxy]

Rail Aliiev [:rail]

Reporter

Comment 3

•

9 years ago

Now we starting using these for releases! \o/

Blocks: release-promotion

Rail Aliiev [:rail]

Reporter

Comment 4

•

9 years ago

Greg, what do you think if we bump default 60s timeout for the following commands to something like 120? https://github.com/taskcluster/taskcluster-vpn-proxy/compare/master...rail:gateway_timeout?expand=1

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Assignee

Comment 5

•

9 years ago

I'm definitely up for it. If you want to open up the PR, I could get it merged and a new proxy image deployed and used by those workers

Flags: needinfo?(garndt)

Rail Aliiev [:rail]

Reporter

Comment 6

•

9 years ago

Attached file Increase timeouts — Details

(In reply to Greg Arndt [:garndt] from comment #5) > I'm definitely up for it. If you want to open up the PR, I could get it > merged and a new proxy image deployed and used by those workers Let's do it! :)

Attachment #8731686 - Flags: review?(garndt)

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Attachment #8731686 - Flags: review?(garndt) → review+

Greg Arndt [:garndt]

Assignee

Comment 7

•

9 years ago

This has been merged, but note the following steps I need to complete and based on some stuff going on and being late in the week might take me a little time: 1. Build vpn proxy image 2. Update docker-worker to reference that new version 3. deploy new docker-worker to those workertypes

Rail Aliiev [:rail]

Reporter

Comment 8

•

9 years ago

There is no rush. Feel free to postpone this to next week.

Greg Arndt [:garndt]

Assignee

Comment 9

•

9 years ago

You're the best, thank you!

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Assignee: nobody → garndt

Greg Arndt [:garndt]

Assignee

Comment 10

•

9 years ago

I'll assign this to myself for the next pieces that need to happen, and I'll update this bug.

Status: NEW → ASSIGNED

Greg Arndt [:garndt]

Assignee

Comment 11

•

9 years ago

New amis should be rolling out, please let me know if you see any issues.

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Rail Aliiev [:rail]

Reporter

Comment 12

•

9 years ago

woot! Thanks a lot!

Rail Aliiev [:rail]

Reporter

Comment 13

•

9 years ago

Seen this once today. I wonder if we should schedule a rerun in this case...

Rail Aliiev [:rail]

Reporter

Comment 14

•

9 years ago

Doesn't look that increasing the timeout resolves the issue. https://tools.taskcluster.net/task-inspector/#KjPszsj5RoezXmL4yVgc1A/0 Maybe we should look at abortRun() and just rerun until we reach `retries`?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Greg Arndt [:garndt]

Assignee

Comment 15

•

9 years ago

You should be able to set the retry number when submitting the graph of these tasks and if the task fails, it will retry up to that number of times. Maybe that's enough to work around this? I believe we are moving away from requiring vpn at all.

Rail Aliiev [:rail]

Reporter

Comment 16

•

9 years ago

I think it's already set, "retries": 5, see https://queue.taskcluster.net/v1/task/KjPszsj5RoezXmL4yVgc1A

Greg Arndt [:garndt]

Assignee

Comment 17

•

9 years ago

That's how many times a task will retry if there was an exception reported (worker-shutdown for instance). These tasks right now are marked as "failed" and the only way they get retried is by specifying "reruns" when submitting the task graph. This is listed in the request payload for a task node here: http://docs.taskcluster.net/scheduler/api-docs/#createTaskGraph

Rail Aliiev [:rail]

Reporter

Comment 18

•

9 years ago

Oh, sorry, mixed reruns and retries. Let's close this bug, I'll address this by specifying reruns.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

Rail Aliiev [:rail]

Reporter

Updated

•

9 years ago

Comment 19

•

9 years ago

Yup, setting "reruns" solves the issue: https://tools.taskcluster.net/task-inspector/#Fd9OCbp_QjerTJchTBR1CA/0

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 20

•

9 years ago

Great to hear that we were able to workaround this issue. But shouldn't we still investigate what's actually broken here? It could still be that reruns are broken because of that problem.

Rail Aliiev [:rail]

Reporter

Comment 21

•

9 years ago

Balrog VPN proxy will be dead soon, we are going to use balrog workers instead. IMO it's not worth to invest more into fixing the current approach.

Pete Moore [:pmoore][:pete]

Updated

•

8 years ago

Updated

•

6 years ago

Component: Docker-Worker → Workers

You need to log in before you can comment on or make changes to this bug.