Closed Bug 1187937 Opened 4 years ago Closed 4 years ago

docker-worker fails to initialize with "Gateway Time-out"

Categories

(Taskcluster :: Workers, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: garndt)

References

Details

(Whiteboard: [balrog-vpn-proxy])

Attachments

(1 file)

Similar to Bug #1187934, but different signature.

Sounds like it's trying to initialize the link, but fails. Probably should be an infra error, so TC reschedules the task.

[taskcluster] taskId: IiFbGm0zS8qWPAlM44d7oA, workerId: i-46838995

[taskcluster] Error: Task was aborted because states could not be created successfully. Error: Error: Gateway Time-out
[taskcluster] Unsuccessful task run with exit code: -1 completed in 923.868 seconds
Depends on: 1187960
No longer depends on: 1187960
This is #1 failure for funsize now. Tens a day, see https://treeherder.allizom.org/#/jobs?repo=mozilla-aurora&exclusion_profile=false&filter-searchStr=funsize

Any suggestions regarding what and how I should tackle to make this retry better?
I'm going to ni? myself on this so I can check it out for you.
Flags: needinfo?(garndt)
Flags: needinfo?(garndt)
Whiteboard: [balrog-vpn-proxy]
Now we starting using these for releases! \o/
Greg, what do you think if we bump default 60s timeout for the following commands to something like 120?

https://github.com/taskcluster/taskcluster-vpn-proxy/compare/master...rail:gateway_timeout?expand=1
Flags: needinfo?(garndt)
I'm definitely up for it.  If you want to open up the PR, I could get it merged and a new proxy image deployed and used by those workers
Flags: needinfo?(garndt)
Attached file Increase timeouts
(In reply to Greg Arndt [:garndt] from comment #5)
> I'm definitely up for it.  If you want to open up the PR, I could get it
> merged and a new proxy image deployed and used by those workers

Let's do it! :)
Attachment #8731686 - Flags: review?(garndt)
Attachment #8731686 - Flags: review?(garndt) → review+
This has been merged, but note the following steps I need to complete and based on some stuff going on and being late in the week might take me a little time:

1. Build vpn proxy image
2. Update docker-worker to reference that new version
3. deploy new docker-worker to those workertypes
There is no rush. Feel free to postpone this to next week.
You're the best, thank you!
Assignee: nobody → garndt
I'll assign this to myself for the next pieces that need to happen, and I'll update this bug.
Status: NEW → ASSIGNED
New amis should be rolling out, please let me know if you see any issues.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
woot! Thanks a lot!
Seen this once today. I wonder if we should schedule a rerun in this case...
Doesn't look that increasing the timeout resolves the issue. https://tools.taskcluster.net/task-inspector/#KjPszsj5RoezXmL4yVgc1A/0

Maybe we should look at abortRun() and just rerun until we reach `retries`?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
You should be able to set the retry number when submitting the graph of these tasks and if the task fails, it will retry up to that number of times.  Maybe that's enough to work around this? I believe we are moving away from requiring vpn at all.
That's how many times a task will retry if there was an exception reported (worker-shutdown for instance).  These tasks right now are marked as "failed" and the only way they get retried is by specifying "reruns" when submitting the task graph.

This is listed in the request payload for a task node here: http://docs.taskcluster.net/scheduler/api-docs/#createTaskGraph
Oh, sorry, mixed reruns and retries. Let's close this bug, I'll address this by specifying reruns.
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
See Also: → 1259566
Great to hear that we were able to workaround this issue. But shouldn't we still investigate what's actually broken here? It could still be that reruns are broken because of that problem.
Balrog VPN proxy will be dead soon, we are going to use balrog workers instead. IMO it's not worth to invest more into fixing the current approach.
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.