Closed Bug 1588392 Opened Last month Closed 27 days ago

Switch to new GCP treescript workers

Categories

(Release Engineering :: Release Automation: Other, task)

task
Not set

Tracking

(firefox71 fixed)

RESOLVED FIXED
Tracking Status
firefox71 --- fixed

People

(Reporter: nthomas, Assigned: nthomas)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Release only worker

  • tagging at the start of releases via release-early-tagging
  • tagging and version bump via release-early-tagging

Will need to review instance sizing since this needs to clone gecko.

gecko-1-tree can do early tagging ok - eg https://tools.taskcluster.net/groups/SpyLLQRFRCyQij-XGkio_A/tasks/L268SOv0TVmYnoSpOEk4Gg

  • Initial clone sets up the hg share and takes 23 minutes (AWS unknown, we never do this with a static host)
  • subsequent runs are 4m 40s (AWS ~ 3m 30s)
  • that's for a CPU request of 1000m, memory request 4000M

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #1)

gecko-1-tree can do early tagging ok - eg https://tools.taskcluster.net/groups/SpyLLQRFRCyQij-XGkio_A/tasks/L268SOv0TVmYnoSpOEk4Gg

  • Initial clone sets up the hg share and takes 23 minutes (AWS unknown, we never do this with a static host)
  • subsequent runs are 4m 40s (AWS ~ 3m 30s)
  • that's for a CPU request of 1000m, memory request 4000M

Beefy instances :) Seems like in AWS we're using a t2.medium for both treescriptworker1 and treescriptworker-dev1. Looking at the AWS docs seems like we need 2 vCPUs and 4 GB of RAM. We've had this problem with signingworkers as well, where we constantly bumped the memory and CPUs until we closed the timeframe gap with AWS counterparts.

Luckily we only need an instance here, no? (AWS world we only had one)
Or is it worth allocating two?

Attached image Utilisation graphs

The first early tag is at about 16:00, and includes a clone. Two reruns at ~17:00 use an hg share. You can see we have higher limits than request - 1200m and 4500M - and that does get used.

The node itself is 2 vCPU and 7.5G, so we're heading toward single occupancy if we go much higher on request. Limits might help a bit. Mercurial is only going to use a single CPU as a python app, so it's extra OS load it'd help with.

Overall, maybe it's not a big deal given these are leaf tasks, and could just make the ship graph a little longer.

The creds should be good on gecko-3-tree - here's a maple early tagging from a week ago:
https://tools.taskcluster.net/groups/GM-nVD0MQ2W36kyRRRh8rA/tasks/c0VigaWARJyfaV2dYmIEGw/details

Pushed by nthomas@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d41b80604a69
Switch to new GCP treescript workers, r=mtabara
Status: ASSIGNED → RESOLVED
Closed: 27 days ago
Resolution: --- → FIXED

I don't think we need any autoscale patch to match the one worker we have in AWS.

We could consider autoscaling between 0 and 1 given we only need treescript twice per release, which becomes 10x 5 mins = 50 minutes per week most of the beta cycle. One downside of having no instance live is that the first job will spend 20-25 minutes doing the initial clone of the hg share.
Aki and I speculated that we could pre-populate the hg share during the image build, and then only need to pull in the delta since then. Some of the time win from that may be eroded by the increased size of the image affecting compression time, as well as transfer to docker-hub and into the k8s cluster.

Kubernetes also has Volume and Persistent Volume support to share data between containers, but we'd want to make sure the backing store is fast because hg does so much I/O (ie not the slow AWS S3 block store we had on workers a while back). I'm also not sure if that is compatible with more than one container (race conditions in the share), or how storage vs GCE costs work out.

At some point l10n-bumper will move into treescript and run jobs every hour.

The gecko change was uplifted to beta by the sheriffs:
https://hg.mozilla.org/releases/mozilla-beta/rev/d41b80604a6952fcf20d08250b60ce065223d24b

Wasn't included in 71.0b2 but will be in b3.

You need to log in before you can comment on or make changes to this bug.