Closed
Bug 1187937
Opened 9 years ago
Closed 9 years ago
docker-worker fails to initialize with "Gateway Time-out"
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: garndt)
References
Details
(Whiteboard: [balrog-vpn-proxy])
Attachments
(1 file)
Similar to Bug #1187934, but different signature.
Sounds like it's trying to initialize the link, but fails. Probably should be an infra error, so TC reschedules the task.
[taskcluster] taskId: IiFbGm0zS8qWPAlM44d7oA, workerId: i-46838995
[taskcluster] Error: Task was aborted because states could not be created successfully. Error: Error: Gateway Time-out
[taskcluster] Unsuccessful task run with exit code: -1 completed in 923.868 seconds
Reporter | ||
Comment 1•9 years ago
|
||
This is #1 failure for funsize now. Tens a day, see https://treeherder.allizom.org/#/jobs?repo=mozilla-aurora&exclusion_profile=false&filter-searchStr=funsize
Any suggestions regarding what and how I should tackle to make this retry better?
Assignee | ||
Comment 2•9 years ago
|
||
I'm going to ni? myself on this so I can check it out for you.
Flags: needinfo?(garndt)
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(garndt)
Whiteboard: [balrog-vpn-proxy]
Reporter | ||
Comment 3•9 years ago
|
||
Now we starting using these for releases! \o/
Blocks: release-promotion
Reporter | ||
Comment 4•9 years ago
|
||
Greg, what do you think if we bump default 60s timeout for the following commands to something like 120?
https://github.com/taskcluster/taskcluster-vpn-proxy/compare/master...rail:gateway_timeout?expand=1
Flags: needinfo?(garndt)
Assignee | ||
Comment 5•9 years ago
|
||
I'm definitely up for it. If you want to open up the PR, I could get it merged and a new proxy image deployed and used by those workers
Flags: needinfo?(garndt)
Reporter | ||
Comment 6•9 years ago
|
||
(In reply to Greg Arndt [:garndt] from comment #5)
> I'm definitely up for it. If you want to open up the PR, I could get it
> merged and a new proxy image deployed and used by those workers
Let's do it! :)
Attachment #8731686 -
Flags: review?(garndt)
Assignee | ||
Updated•9 years ago
|
Attachment #8731686 -
Flags: review?(garndt) → review+
Assignee | ||
Comment 7•9 years ago
|
||
This has been merged, but note the following steps I need to complete and based on some stuff going on and being late in the week might take me a little time:
1. Build vpn proxy image
2. Update docker-worker to reference that new version
3. deploy new docker-worker to those workertypes
Reporter | ||
Comment 8•9 years ago
|
||
There is no rush. Feel free to postpone this to next week.
Assignee | ||
Comment 9•9 years ago
|
||
You're the best, thank you!
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → garndt
Assignee | ||
Comment 10•9 years ago
|
||
I'll assign this to myself for the next pieces that need to happen, and I'll update this bug.
Status: NEW → ASSIGNED
Assignee | ||
Comment 11•9 years ago
|
||
New amis should be rolling out, please let me know if you see any issues.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 12•9 years ago
|
||
woot! Thanks a lot!
Reporter | ||
Comment 13•9 years ago
|
||
Seen this once today. I wonder if we should schedule a rerun in this case...
Reporter | ||
Comment 14•9 years ago
|
||
Doesn't look that increasing the timeout resolves the issue. https://tools.taskcluster.net/task-inspector/#KjPszsj5RoezXmL4yVgc1A/0
Maybe we should look at abortRun() and just rerun until we reach `retries`?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 15•9 years ago
|
||
You should be able to set the retry number when submitting the graph of these tasks and if the task fails, it will retry up to that number of times. Maybe that's enough to work around this? I believe we are moving away from requiring vpn at all.
Reporter | ||
Comment 16•9 years ago
|
||
I think it's already set, "retries": 5, see https://queue.taskcluster.net/v1/task/KjPszsj5RoezXmL4yVgc1A
Assignee | ||
Comment 17•9 years ago
|
||
That's how many times a task will retry if there was an exception reported (worker-shutdown for instance). These tasks right now are marked as "failed" and the only way they get retried is by specifying "reruns" when submitting the task graph.
This is listed in the request payload for a task node here: http://docs.taskcluster.net/scheduler/api-docs/#createTaskGraph
Reporter | ||
Comment 18•9 years ago
|
||
Oh, sorry, mixed reruns and retries. Let's close this bug, I'll address this by specifying reruns.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 19•9 years ago
|
||
Yup, setting "reruns" solves the issue: https://tools.taskcluster.net/task-inspector/#Fd9OCbp_QjerTJchTBR1CA/0
Comment 20•9 years ago
|
||
Great to hear that we were able to workaround this issue. But shouldn't we still investigate what's actually broken here? It could still be that reruns are broken because of that problem.
Reporter | ||
Comment 21•9 years ago
|
||
Balrog VPN proxy will be dead soon, we are going to use balrog workers instead. IMO it's not worth to invest more into fixing the current approach.
Updated•6 years ago
|
Component: Docker-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•