Closed Bug 1500274 Opened Last year Closed 8 months ago

performance regression in Windows build times, central and beta builds often timing out

Categories

(Firefox Build System :: Toolchains, defect, critical)

defect
Not set
critical

Tracking

(firefox65 fixed, firefox66 fixed)

RESOLVED WORKSFORME
Tracking Status
firefox65 --- fixed
firefox66 --- fixed

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: regression)

Attachments

(1 file)

Pushed by nthomas@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/fdd2d783dd2e
increase timeouts for Windows nightly builds, r=RyanVM (irc), a=RyanVM
FTR, Win PGO builds already have a 3hr timeout, so I think it's reasonable to do the same for Nightly builds.

More observations - jobs on c5.4xlarge instances run successfully in around 100min. Jobs on c4.4xlarge instances are the ones timing. Win PGO builds (with the aforementioned 3hr timeouts) are finishing on c4.4xlarge instances in ~125min, so we're *just* barely going over the limit here.
There are multiple things going wrong here, and increasing the timeouts is only hiding the problem.

One is that the msvc builds on beta are not using the msvc mozconfigs, so they are effectively clang-cl builds.

Another is the regression in build times, and seeing the corresponding alert (https://treeherder.mozilla.org/perf.html#/alerts?id=16907) it looks like there might be something fishy going on with sccache, but it also looks like there is a legit build time regression from bug 1486554.
(In reply to Mike Hommey [:glandium] from comment #4)
> One is that the msvc builds on beta are not using the msvc mozconfigs, so
> they are effectively clang-cl builds.

Filed bug 1500290.
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #4)
> it looks like there might be something fishy going on with sccache

Filed bug 1500295.
There is a consistent ~12-13% regression in build times from using the plugin on Windows. I measured what it looks like on linux64, and it's in the order of ~9-10%, so it's only slightly slower, which is not entirely surprising, as on Windows, the plugin contains all the LLVM and clang code it uses, effectively duplicating what's in the clang binary. That's not the case on Linux.

So short of fixing sccache on Windows (bug 1476604), there's not much we can do in the short term besides increasing the timeouts.

We may want to look more closely whether there are some avoidable performance pitfalls in the plugin implementation separately (which would improve the performance on both Linux and Windows). I'll gather some more data and file a separate bug for that.

Please close this bug if the timeout increases addressed the issue on all the builds that were affected.
The opt builds (B) on beta were also hitting timeouts. But eparately bstack removed c4.4xlarge from the allowed instance types for gecko-3-b-win2012 (https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/view). I'm not sure if the intention is to do that permanently, but it would solve the timeout issue.
The opt builds on beta are PGO, their timeout was probably borderline before.
regarding the removal of c4.4xlarge worker types from the allowed instance types mentioned in comment 8:

we're seeing difficulty obtaining enough c5 instances to keep pending times normal. eg: https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/health shows that many spot requests are rejected by ec2 with the message: InsufficientInstanceCapacity

if we need builds to occur on c5 only, we'll need to increase the number of regions we support building in (set up sccache and hg buckets and acls) so that it's possible to obtain the number of instances we require. currently we can build in:
eu-central-1
us-east-1
us-east-2
us-west-1
us-west-2

maybe we should look at adding sccache and hg s3 buckets and enabling builds in:
ap-south-1
ap-northeast-1
ap-northeast-2
ap-southeast-1
ap-southeast-2
ca-central-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
worker type gecko-3-b-win2012 was configured to only provision instances in us-east-1, us-west-1 & us-west-2.
within those three regions we only have 38 instances running and ~500 pending jobs.

i have added regions us-east-2 & eu-central-1 to the gecko-3-b-win2012 configuration (this now matches the gecko-1-b-win2012 configuration) in an effort to get more c5.4xlarge instances up and the queue reduced.

i know that in the past we have disabled us-east-2 & eu-central-1 because of hg issues (slow clone times), so if we see problems relating to that again or other region issues (maybe sccache), we'll need to remove those regions again and rethink what we're going to do.

the configuration that i added regions to is here (in case we need to revert my changes):
https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/edit
i have added c4 instance types back to the gecko-3-b-win2012 worker type definition in an effort to get some instances running. the extra regions did not give us any new instances after an hour or so of waiting there are still only 38 instances and the queue is growing despite the trees being closed.

also there are errors like this in papertrail:
"ERROR aws-provisioner-production: error provisioning this worker type, skipping (workerType=gecko-3-b-win2012, err={})"
https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=989867078829891597

the most recent health message at: https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/health is more than 16 hours old.

i'm not sure why nothing has been provisioned in more than 15 hours and have been unable to get a response to !t-rex on #taskcluster.
i also removed the new regions (use2,euc1) for gecko-3-b-win2012.
there are 600 instances up now. all c4 except the 38 already running c5 instances.
No longer depends on: 1500361
Win2012 opt B & Bmsvc started to timeout again after this merge to beta: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?changeset=01378c910610cd214b2838650d0d2b7218fa8b5d

TH link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&resultStatus=success%2Cbusted&searchStr=win%2C2012%2Copt&tochange=01378c910610cd214b2838650d0d2b7218fa8b5d&fromchange=8efe26839243319464f00a472363e392de27cd4a&selectedJob=206720956

When this landed on beta and central the builds were green under the 120mins mark (e.g. beta http://tinyurl.com/y8t6jf3b - 100mins central http://tinyurl.com/ycst9ejc - 56 mins) so not sure if that really addressed the issue because two days later they still timed out out after 120mins.  http://tinyurl.com/yb37ezoe
Build times on Beta and Central aren't directly comparable because Beta has PGO enabled by default and m-c doesn't (you'd need to compare Beta vs. m-c PGO to get a more fair comparison). Anyway, yes, the underlying issue is still there - PGO builds on c4 instances are prone to timing out and ones on c5 instances aren't, which is basically luck of the draw.

That said, I'll bump the build timeout on Beta64 to at least cut down on the noise for now until we can come up with something more robust. I'm not going to touch the MSVC builds, though, since those should be fixed once bug 1500290 is sorted out (shudder).
Bump the timeout on Beta for Windows opt builds to match other PGO-enabled builds:
https://hg.mozilla.org/releases/mozilla-beta/rev/aba25ceab10f
I wonder if it's worthwhile to configure a separate worker type that will only provide c5.4xlarge instances, and change our task definitions so that PGO builds use that worker type, where other builds can use the standard gecko-3-b-win2012? If we have trouble maintaining capacity without m4.4xlarge instances in the mix it doesn't seem sensible to force all our builds onto the larger instances.
Attached file GitHub Pull Request
this pr adds worker types gecko-3-b-win2012-c4 & gecko-3-b-win2012-c5 to the occ ci ami builds. note that this will only create the extra worker types. once we verify that they work as expected, we can look at changing builds to make use of them (or not, depending on what's decided about how to move forward)
Attachment #9019137 - Flags: review?(mcornmesser)
Attachment #9019137 - Flags: review?(mcornmesser) → review+
First beta merge is next Monday and the Windows opt build times will have to be increased again from 2h to 3h. Does anything else need to be done?
Flags: needinfo?(rthijssen)
my suggestion would be to explicitly use worker type: gecko-3-b-win2012-c5 (instead of gecko-3-b-win2012)

> - gecko-3-b-win2012-c4 always uses the slower c4.4xlarge ec2 instance types.
> - gecko-3-b-win2012-c5 always uses the faster c5.4xlarge ec2 instance types.
> - gecko-3-b-win2012 is a lottery and may use either c4.4xlarge or c5.4xlarge ec2 instance types depending on what the provisioner asks for and what ec2 has available at the time of the spot request.

in-tree changes will be needed to specify that gecko-3-b-win2012-c5 should be used for those builds, but what changes are needed is beyond my expertise.
Flags: needinfo?(rthijssen)
Do we know if there is more c5.4xlarge spot-capacity now than at the time of comment #13 ? The reduced build times with that family are great but it'd be a worry if the builds are pinned to a scarce resource.
Can we force only release branches onto the c5 workers and leave trunk for whatever's available (c5 preferred)?
Duplicate of this bug: 1514639

Build times are still close to the timeout, have applied the timeout increase once more: https://hg.mozilla.org/releases/mozilla-beta/rev/31f926caf2452050575ab84e54e6cdfbf6b2cbfc

Closing this for now, hasn't failed since Gecko 67 landed on beta 3 days ago. https://treeherder.mozilla.org/logviewer.html#?job_id=233898544&repo=mozilla-beta run for 5 minutes less than the limit allowed.

Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → WORKSFORME
Duplicate of this bug: 1572339
You need to log in before you can comment on or make changes to this bug.