Closed Bug 1583236 Opened 4 months ago Closed 4 months ago

level-1 GCP workers can't contact the "dev" external instances (bouncer, balrog, shipit) from GCP "prod" environment

Categories

(Release Engineering :: Release Automation: Bouncer, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mtabara, Assigned: mtabara)

References

(Blocks 1 open bug, Regression)

Details

We turned on GCP bouncerworkers and that worked smooth on production. But in a staging release that I triggered earlier today, I noticed the job is starving in pending state - https://tools.taskcluster.net/groups/N1Jojup1QT-6UotcPUzrwg/tasks/HVXNTIdmTNSda5IJxL3EAA/details

We may have gotten wrong some configurations in cloudops-infra or somewhere. I'm investigating further.

Turns out there are some beetmover workers too that were pending. Something's fishy ... see Exception tasks under https://tools.taskcluster.net/groups/N1Jojup1QT-6UotcPUzrwg

Okay, so beetmover-source-firefox-source/opt ended up with deadline exceeded because its dep in signing failed. Since we don't yet have GCP signing rolled-out yet, it used depsigning/gecko-level-1 workers. So we can ignored this one as it's expected. Because of it, post-beetmover-dummy didn't go green, which means release-generate-checksums starved until expiration. Eventually release-notify-promote-firefox failed too for the same reason.

So gecko-1-beetmover is fine so we can go back to debugging gecko-1-bouncer.

The fact that we don't actually have gecko-1-bouncer under https://tools.taskcluster.net/provisioners/scriptworker-k8s/worker-types suggests we missed adding the corresponding configs somewhere too ...

Managed to figure things out after inspecting the actual Console logs of this worker. It was failing for missing scopes in https://github.com/mozilla-services/cloudops-infra/blob/master/projects/relengworker/k8s/values/bouncer.yaml#L14. We were using the old structure of gecko-t-bouncer instead of gecko-1-bouncer. I've adjusted the scopes accordingly and now we're claiming tasks successfully.

I'm not hitting netflows (I believe?) issues as we can't talk to neither dev bouncer instance, nor the nazgul counterpart. I think that's because we only have netflows to access production instances (Balrog, Bouncer, etc) from PROD environment, while dev instances can only be accessed from NON-PROD environment. We need to open the access to DEV instances as well for all thegecko-1-{script} communication.

https://tools.taskcluster.net/groups/U2bxpgMlTjGbywI-xGCoeg/tasks/B4HQUi8YTR2AKwR2XNm6Yw/details
https://tools.taskcluster.net/groups/U2bxpgMlTjGbywI-xGCoeg/tasks/fV5H0BGHSQOLFEWOOq_7EA/details

(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #6)

I'm not hitting netflows (I believe?) issues as we can't talk to neither dev bouncer instance, nor the nazgul counterpart. I think that's because we only have netflows to access production instances (Balrog, Bouncer, etc) from PROD environment, while dev instances can only be accessed from NON-PROD environment. We need to open the access to DEV instances as well for all thegecko-1-{script} communication.

https://tools.taskcluster.net/groups/U2bxpgMlTjGbywI-xGCoeg/tasks/B4HQUi8YTR2AKwR2XNm6Yw/details
https://tools.taskcluster.net/groups/U2bxpgMlTjGbywI-xGCoeg/tasks/fV5H0BGHSQOLFEWOOq_7EA/details

Dropping an open-question for this to @oremj as I'm not sure I'm right in my supposition.

Flags: needinfo?(oremj)
Summary: level-1 GCP bouncerworkers are starving in `pending` mode → level-1 GCP bouncerworkers can't be contacted from GCP "prod" environment
Summary: level-1 GCP bouncerworkers can't be contacted from GCP "prod" environment → level-1 GCP workers can't contact the "dev" external instances (bouncer, balrog, shipit) from GCP "prod" environment

Talked to CloudOps today.
Short term plan: enable access from prod environment to dev instances
Long term plan: spawn new fakeprod instances of external resources (bouncer, balrog, shipit, etc)

I'll file a separate bug to track the longer term plan.

Flags: needinfo?(oremj)
Blocks: 1533337

Bouncer level-1 in try staging release going green - https://tools.taskcluster.net/groups/CzQQTMmJQ_a8h1vPO2ZPcg/tasks/C3t0ccE5RrejuUmmLDv8tA/details

Balrog level-1 in try staging release going green - https://tools.taskcluster.net/groups/CzQQTMmJQ_a8h1vPO2ZPcg/tasks/djPi_VaVQm2BVUPCQ69W4A/details

Thanks for opening the netflows :oremj!

Closing this, we'll talk more about the long-term real solution in bug 1533337.

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.