Closed Bug 1580478 Opened 6 years ago Closed 6 years ago

Switch to new GCP balrogworkers

Categories

(Release Engineering :: Release Automation, task)

task
Not set
normal

Tracking

(firefox71 fixed)

RESOLVED FIXED
Tracking Status
firefox71 --- fixed

People

(Reporter: mtabara, Assigned: mtabara)

References

Details

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #1580476 +++

+++ This bug was initially created as a clone of Bug #1579476 +++

Rail gave me a tour in the new world today. It looks like we can switch to the new balrog workers hosted by cloudops at some point next week \o/

Still to investigate whether netflows are fine or not, but once that's confirmed, we can start with nightlies.

Blocks: 1533337
No longer depends on: 1580476, 1579476
Assignee: nobody → mtabara

Brough all branches up-to-date with master in balrogscript and pushed a test patch on try. Most build jobs failed for some reason so I won't be able to see all the locales balrog jobs as I hoped. Nevertheless, the balrog-toplevel-submit job made it thru the new GCP worker - https://tools.taskcluster.net/groups/IDqFsAmWTNi4jUtWi1aFLg/tasks/AJ5TtCMZRlGtnghMW-JB0g/details

Prepping patch to auto-scale both the environments and measurements to understand how much do we need.
For a regular beta (including partner after >= b8), we have ~1200 beetmover jobs but ~500+ balrog ones. So definitely less resources in GCP.

Old AWS puppet-based infrastructure:

  • Beetmoverworkers -> 22 production, 10 in dev
  • Balrogworkers -> 10 production, 10 in dev

GCP-based infrastructure:

  • Beetmoverworkers - max 20 replicas
  • balrogworkers - I'm thinking of max 10, to simulate what we currently have, to begin with?

Average job runtime:

  • beetmover -> ~50 seconds in a beta (in cloudops-infra this is et to 120 seconds)
  • balrog -> ~20 seconds in a bea (will set this in cloudops to 60 seconds to cover upper limit)

Try push's balrog toplevel-submit job https://tools.taskcluster.net/groups/b3jfIBkeTh6BqEwarNPh2A/tasks/X4eEsKI3Th-xyzMGjfeLlQ/details exceeded runtime because no workers consumed this. Double-checking workers, turns out the scopes were wrong, they were still polling gecko-t-balrog instead of gecko-1-balrog. We should be good now.

Staging releases have been behaving very weird lately for me with builds failing all along. I pushed to try to see if it could be related to my patch - very low chances - and it seems to have kicked well so far.

I followed-up with pushing my patch and another push to try.

If this goes green, I'll fire-up a staging release based on latest patch.

Pushed by mtabara@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6822e913651a enable GCP balrogworkers.r=catlee a=release
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

Note to self: if this morning's nightlies work as expected, I'll uplift this to beta too.

Green tasks on cetral but because we have too many workers now (20), we're hitting some conflicts in the blobs which we didn't have before. I'll file a separate bug.

Depends on: 1585321
Component: Release Automation: Updates → Release Automation
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: