Closed Bug 1588392 Opened 5 years ago Closed 5 years ago

Switch to new GCP treescript workers

Tracking

(firefox71 fixed)

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox71

---

fixed

People

(Reporter: nthomas, Assigned: nthomas)

References

Details

Attachments

(2 files)

Utilisation graphs 5 years ago Nick Thomas [:nthomas] (UTC+12) 42.25 KB, image/png		Details
Bug 1588392 - Switch to new GCP treescript workers, r=mtabara 5 years ago Nick Thomas [:nthomas] (UTC+12) 47 bytes, text/x-phabricator-request		Details \| Review

Nick Thomas [:nthomas] (UTC+12)

Assignee

Description

•

5 years ago

Release only worker

tagging at the start of releases via release-early-tagging
tagging and version bump via release-early-tagging

Will need to review instance sizing since this needs to clone gecko.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 1

•

5 years ago

gecko-1-tree can do early tagging ok - eg https://tools.taskcluster.net/groups/SpyLLQRFRCyQij-XGkio_A/tasks/L268SOv0TVmYnoSpOEk4Gg

Initial clone sets up the hg share and takes 23 minutes (AWS unknown, we never do this with a static host)
subsequent runs are 4m 40s (AWS ~ 3m 30s)
that's for a CPU request of 1000m, memory request 4000M

Mihai Tabara [:mtabara]⌚️GMT

Comment 2

•

5 years ago

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #1)

gecko-1-tree can do early tagging ok - eg https://tools.taskcluster.net/groups/SpyLLQRFRCyQij-XGkio_A/tasks/L268SOv0TVmYnoSpOEk4Gg

Initial clone sets up the hg share and takes 23 minutes (AWS unknown, we never do this with a static host)

subsequent runs are 4m 40s (AWS ~ 3m 30s)

that's for a CPU request of 1000m, memory request 4000M

Beefy instances :) Seems like in AWS we're using a t2.medium for both treescriptworker1 and treescriptworker-dev1. Looking at the AWS docs seems like we need 2 vCPUs and 4 GB of RAM. We've had this problem with signingworkers as well, where we constantly bumped the memory and CPUs until we closed the timeframe gap with AWS counterparts.

Luckily we only need an instance here, no? (AWS world we only had one)
Or is it worth allocating two?

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 3

•

5 years ago

Attached image Utilisation graphs — Details

The first early tag is at about 16:00, and includes a clone. Two reruns at ~17:00 use an hg share. You can see we have higher limits than request - 1200m and 4500M - and that does get used.

The node itself is 2 vCPU and 7.5G, so we're heading toward single occupancy if we go much higher on request. Limits might help a bit. Mercurial is only going to use a single CPU as a python app, so it's extra OS load it'd help with.

Overall, maybe it's not a big deal given these are leaf tasks, and could just make the ship graph a little longer.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 4

•

5 years ago

The creds should be good on gecko-3-tree - here's a maple early tagging from a week ago:
https://tools.taskcluster.net/groups/GM-nVD0MQ2W36kyRRRh8rA/tasks/c0VigaWARJyfaV2dYmIEGw/details

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 5

•

5 years ago

Attached file Bug 1588392 - Switch to new GCP treescript workers, r=mtabara — Details

Pulsebot

Comment 6

•

5 years ago

Pushed by nthomas@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d41b80604a69
Switch to new GCP treescript workers, r=mtabara

Natalia Csoregi [:nataliaCs]

Comment 7

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d41b80604a69

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox71: --- → fixed

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 8

•

5 years ago

I don't think we need any autoscale patch to match the one worker we have in AWS.

We could consider autoscaling between 0 and 1 given we only need treescript twice per release, which becomes 10x 5 mins = 50 minutes per week most of the beta cycle. One downside of having no instance live is that the first job will spend 20-25 minutes doing the initial clone of the hg share.
Aki and I speculated that we could pre-populate the hg share during the image build, and then only need to pull in the delta since then. Some of the time win from that may be eroded by the increased size of the image affecting compression time, as well as transfer to docker-hub and into the k8s cluster.

Kubernetes also has Volume and Persistent Volume support to share data between containers, but we'd want to make sure the backing store is fast because hg does so much I/O (ie not the slow AWS S3 block store we had on workers a while back). I'm also not sure if that is compatible with more than one container (race conditions in the share), or how storage vs GCE costs work out.

At some point l10n-bumper will move into treescript and run jobs every hour.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 9

•

5 years ago

The gecko change was uplifted to beta by the sheriffs:
https://hg.mozilla.org/releases/mozilla-beta/rev/d41b80604a6952fcf20d08250b60ce065223d24b

Wasn't included in 71.0b2 but will be in b3.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Switch to new GCP treescript workers

Categories

(Release Engineering :: Release Automation: Other, task)

Tracking

(firefox71 fixed)

People

(Reporter: nthomas, Assigned: nthomas)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Attachment

General

Description

File Name

Content Type