Closed Bug 1286824 Opened 8 years ago Closed 8 years ago

balrogVPNProxy is busted after balrog -> cloudops migration

Categories

(Taskcluster :: Workers, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Assigned: garndt)

References

Details

(Keywords: intermittent-failure)

garndt, pete: can someone take a look at this, its very frequent in the update tests :)
Flags: needinfo?(pmoore)
Flags: needinfo?(garndt)
Flags: needinfo?(pmoore)
See Also: → 1187937, 1187592
I'm going to see if I can catch a smoking gun...
15:03 <bhearsum> pmoore: i think https://bugzilla.mozilla.org/show_bug.cgi?id=1286824 is caused by the balrog -> cloudops migration
15:03 <bhearsum> aus4-admin.mozilla.org changed IPs
Updating summary now that we've diagnosed this a bit. It seems there's a hack in docker-worker that hardcodes the aus4-admin.mozilla.org IP address, so when it changed yesterday the proxy stopped working. I've updated that in https://github.com/taskcluster/docker-worker/pull/234, and Taskcluster folks are currently redeploying docker-worker. Hopefully it will work after that's done.
Blocks: 1248741
Summary: Intermittent [taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for balrogVPNProxy : Internal Server Error → balrogVPNProxy is busted after balrog -> cloudops migration
Severity: normal → blocker
Code is merged in docker-worker - working on making a new AMI now. Spoke to garndt (woke him up) to get help with this.
Even after the AMI was rebuilt we were having issues with the vpn proxy connectinng to the new production interface. After a ton of debugging (thanks to garndt and mostlygeek for that), we finally realized that the taskcluster-balrog LDAP account was missing an ACL. We looped jabba in, and he got it added to the vpn_balrog group, which got things fixed up.

All of the funsize balrog jobs that have started since the acl was added have succeeded. Capacity is still low (we stopped new instances from starting after the issue was discovered), but that will come back up over time.
Flags: needinfo?(garndt)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee: nobody → garndt
Component: General → Docker-Worker
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.