Closed Bug 1525094 Opened 7 months ago Closed 4 months ago

Run Linux builds in GCP at tier 3

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla68

People

(Reporter: coop, Assigned: grenade)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file)

We need to start moving some workloads to GCP so we can appraise performance.

We've done many migrations now, and the easiest place to start in usually Linux builds, so let's try setting up Linux64 opt builds in GCP and go from there. We can run those builds through the existing suite of testing in AWS and have pretty high confidence that the new builds are correct.

Things to be aware of:

  • because we haven't manage to consolidate our worker implementations yet, we'll be using docker-worker for these builds because that's what we're still using in AWS. This will minimize the moving parts and allow for better comparisons.

  • what is the image delta between AWS AMIs and custom images in GCP?
    ** can we generate them in the same way, or even at the same time to prevent skew?

  • what are the performance characteristics of GCP instances vs the comparable setup in AWS?

  • does artifact download time increase substantially in GCP vs AWS? We may want to investigate caching in GCP sooner if we're taking a big hit.

Wander: are you going to be able to start on this bug this week?

Flags: needinfo?(wcosta)

(In reply to Chris Cooper [:coop] pronoun: he from comment #1)

Wander: are you going to be able to start on this bug this week?

Yes, I just need the gcp project those will be running into.

Flags: needinfo?(wcosta)

I won't ask you to create multiple projects until we have a plan, but I think for the first one, I doesn't matter because we can always move/rename/recreate it later.

I would name it "Linux64" and if we decide to do something more clever at the meeting on Wednesday, we can revisit.

I have created 5 instances in GCE. They run docker-worker under gcp branch [1]
The provisionerId is "gce" and the worker-type is "opt-linux64". Each instance has capacity 1. I am currently running a build on each one [2]

[1] https://github.com/taskcluster/docker-worker/tree/gcp
[2] https://tools.taskcluster.net/groups/EI0S2uGdQDiQRWqtzfufww

Component: Docker-Worker → Workers

Instances are running, but under a different worker type: gce/gecko-3-b-linux

Status: ASSIGNED → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED

The important part here that is still outstanding is having the builds running per push on mozilla-central at tier 3. We shouldn't do this until :bc lets us know that the test exception/failure pattern matches that of builds generated in AWS.

bc: is there a bug for greening up the tests, or some other data source where you're tracking the results?

Status: RESOLVED → REOPENED
Flags: needinfo?(bob)
Resolution: FIXED → ---

No bug as of yet. I got the builds to run and got some tests to run but the limited number of gce workers has prevented me from getting results. I was just talking with wcosta and he is bumping the number of workers from 5 up to something. I'll file one once I can get a try push working.

wcosta was under the impression we were only doing builds on gce and not running tests there. I was under the impression I was supposed to evaluate and green up tests which were both built and run on gce. Can you clarify?

Flags: needinfo?(bob)

(In reply to Bob Clary [:bc:] from comment #7)

No bug as of yet. I got the builds to run and got some tests to run but the limited number of gce workers has prevented me from getting results. I was just talking with wcosta and he is bumping the number of workers from 5 up to something. I'll file one once I can get a try push working.

wcosta was under the impression we were only doing builds on gce and not running tests there. I was under the impression I was supposed to evaluate and green up tests which were both built and run on gce. Can you clarify?

We don't have test workers setup yet, only build workers.

What we're trying to discern right now is whether a Linux64 opt build built in GCE is equivalent to one built in AWS. We're triggering all the same tests in AWS on both so that we can figure that out. Once we're sure the builds themselves are OK, Wander can get workers setup in GCE for running tests.

bc: what I'm looking for from you is that comparison piece, i.e. wpt4 fails in both GCE and AWS, wpt8 is intermittent in GCE but solid in AWS, etc. Does that make sense?

Flags: needinfo?(bob)

Yes, that clarifies things greatly. Thanks. I'll have to rethink how to do that.

Flags: needinfo?(bob)
See Also: → 1529951

Given the results in bug 1529951, I've scaled down the Instance Groups for the std8 and std16 groups from 10 to 0. We can probably delete them altogether soon.

Last thing to do here is to get these builds running at tier 3 on m-c against the std32 GCE instances.

No longer blocks: 1536559

(In reply to Chris Cooper [:coop] pronoun: he from comment #10)

Given the results in bug 1529951, I've scaled down the Instance Groups for the std8 and std16 groups from 10 to 0. We can probably delete them altogether soon.

Last thing to do here is to get these builds running at tier 3 on m-c against the std32 GCE instances.

Rob has had some success with the n1-highcpu-32 instances for Windows: https://bugzilla.mozilla.org/show_bug.cgi?id=1531378#c19 Since they are cheaper than the standard variant, I've recreated our Linux GCE pool using highcpu instances and am going to retrigger some plain Linux builds to see what the delta is.

(In reply to Chris Cooper [:coop] pronoun: he from comment #11)

Rob has had some success with the n1-highcpu-32 instances for Windows: https://bugzilla.mozilla.org/show_bug.cgi?id=1531378#c19 Since they are cheaper than the standard variant, I've recreated our Linux GCE pool using highcpu instances and am going to retrigger some plain Linux builds to see what the delta is.

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&searchStr=plain&selectedJob=236181184&revision=0c1a53c5a11e556b4d1f5ce1debf0a63928462cf

(In reply to Chris Cooper [:coop] pronoun: he from comment #12)

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&searchStr=plain&selectedJob=236181184&revision=0c1a53c5a11e556b4d1f5ce1debf0a63928462cf

There's some scatter, but builds on the highcpu instances take on average 3 minutes longer than on standard, or about 12% of the overall build time (26 minutes).

n1-standard-32: 1380s avg build time
n1-highcpu-32: 1550s avg build time

This indicates to me that we are only occasionally memory-bound. The jump in memory from the highcpu to the standard is substantial: 28.8GB -> 120GB. We probably don't need all that extra. Indeed, the AWS instances we use only have 32GB of memory.

I think I can toggle the memory directly in GCE. I'll try that tomorrow. However, given that we're already doubling the vCPUs vs AWS to get comparable performance, there are diminishing returns here.

We might as well enable all the Linux builds at tier 3 here, not just 64-bit opt.

Summary: Run Linux64 opt builds in GCP at tier 3 → Run Linux builds in GCP at tier 3

this change adds support for parallel gcp builds for the following linux build configurations:

  • linux(32)
    • opt
    • debug
    • shippable
  • linux64
    • opt
    • debug
    • shippable

implementation notes:

  • this patch mostly mirrors the equivalent windows-on-gcp patch at: https://phabricator.services.mozilla.com/D24865
  • gcp builds are triggered with a treeherder tier 3 flag so that they are only displayed in the treeherder ui when the user has a tier 3 flag set.
  • gcp builds use a th build symbol of "Bg" to make them easy to differentiate from ec2 builds in the treeherder ui.
  • gcp builds use a perfherder "gcp" flag to make them easier to differentiate from ec2 builds in the perfherder ui.
  • gcp builds on linux for all scm levels are built on the only available gcp linux worker type (at the time of this change): gce/gecko-1-b-linux-32

if someone with write access to the gcloud project hosting the gce/gecko-1-b-linux-32 workers (maybe "linux64-builds"?), could create a livelog firewall exception, that'd be great. it should be possible with a command similar to:

gcloud compute firewall-rules create livelog-direct --allow tcp:60023 --description "allows connections to livelog GET interface, running on taskcluster worker instances"

wcosta: i've just noticed a reference to gce/gecko-3-b-linux in comment 5 above. are there also gce/gecko-2-b-linux-32 & gce/gecko-3-b-linux-32 worker types available?

if so, i'll update the patch to make use of them...

also, for the windows patch in bug 1536555, we needed to set up some temporary provisioning to ensure we have enough windows instances to handle the gcp load.

it looks to me from the testing i did, that there are only 10 running instances of gce/gecko-1-b-linux-32

from what we learned from the windows patch, we'll probably need closer to 40 running instances to handle normal US daytime load. do we have a (temporary) provisioning strategy to cope with this?

Flags: needinfo?(wcosta)
Attachment #9056509 - Flags: review?(wcosta)
Depends on: 1542997
Depends on: 1543164
Flags: needinfo?(wcosta)
Attachment #9056509 - Flags: review?(wcosta) → review+
Assignee: wcosta → rthijssen
Status: REOPENED → ASSIGNED
Keywords: checkin-needed

Pushed by aiakab@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/679d694324d4
run linux builds in gcp at tier 3 r=wcosta

Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 6 months ago4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Blocks: 1546414
You need to log in before you can comment on or make changes to this bug.