Closed
Bug 1500274
Opened 6 years ago
Closed 6 years ago
performance regression in Windows build times, central and beta builds often timing out
Categories
(Firefox Build System :: Toolchains, defect)
Firefox Build System
Toolchains
Tracking
(firefox65 fixed, firefox66 fixed)
RESOLVED
WORKSFORME
People
(Reporter: aryx, Unassigned)
References
Details
(Keywords: regression)
Attachments
(1 file)
Today we have seen Windows build time regressions:
https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central,1460548,1,2&series=autoland,1460334,1,2&series=mozilla-inbound,1460234,1,2&zoom=1539791467530.0327,1539897557000,2000.0000000000002,4000&selected=autoland,1460334,392432,609977213,2
The changelog of the first build with the performance regression contained bug 1486554, but Bmsvc also seems to be affected. This causes frequent build timeouts on central and beta. See e.g. this push on beta:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&resultStatus=success%2Cusercancel%2Crunnable%2Ctestfailed%2Cbusted%2Cexception%2Cretry&group_state=expanded&revision=8efe26839243319464f00a472363e392de27cd4a&searchStr=windows%2Cbuild&selectedJob=206426657
Mike, can you take a look at this, please? Thank you in advance.
Flags: needinfo?(mh+mozilla)
Pushed by nthomas@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/fdd2d783dd2e
increase timeouts for Windows nightly builds, r=RyanVM (irc), a=RyanVM
Comment 2•6 years ago
|
||
Workaround - bumping the timeout for nightly builds from 2 hours to 3:
https://hg.mozilla.org/mozilla-central/rev/fdd2d783dd2e354ab9dae7d04912cd6b937ba9b3
https://hg.mozilla.org/releases/mozilla-beta/rev/c55eb5a6638b0dc7d73dc028cccdf2c3c68f6bea
Updated•6 years ago
|
Keywords: leave-open
Comment 3•6 years ago
|
||
FTR, Win PGO builds already have a 3hr timeout, so I think it's reasonable to do the same for Nightly builds.
More observations - jobs on c5.4xlarge instances run successfully in around 100min. Jobs on c4.4xlarge instances are the ones timing. Win PGO builds (with the aforementioned 3hr timeouts) are finishing on c4.4xlarge instances in ~125min, so we're *just* barely going over the limit here.
Comment 4•6 years ago
|
||
There are multiple things going wrong here, and increasing the timeouts is only hiding the problem.
One is that the msvc builds on beta are not using the msvc mozconfigs, so they are effectively clang-cl builds.
Another is the regression in build times, and seeing the corresponding alert (https://treeherder.mozilla.org/perf.html#/alerts?id=16907) it looks like there might be something fishy going on with sccache, but it also looks like there is a legit build time regression from bug 1486554.
Comment 5•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #4)
> One is that the msvc builds on beta are not using the msvc mozconfigs, so
> they are effectively clang-cl builds.
Filed bug 1500290.
Flags: needinfo?(mh+mozilla)
Comment 6•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #4)
> it looks like there might be something fishy going on with sccache
Filed bug 1500295.
Comment 7•6 years ago
|
||
There is a consistent ~12-13% regression in build times from using the plugin on Windows. I measured what it looks like on linux64, and it's in the order of ~9-10%, so it's only slightly slower, which is not entirely surprising, as on Windows, the plugin contains all the LLVM and clang code it uses, effectively duplicating what's in the clang binary. That's not the case on Linux.
So short of fixing sccache on Windows (bug 1476604), there's not much we can do in the short term besides increasing the timeouts.
We may want to look more closely whether there are some avoidable performance pitfalls in the plugin implementation separately (which would improve the performance on both Linux and Windows). I'll gather some more data and file a separate bug for that.
Please close this bug if the timeout increases addressed the issue on all the builds that were affected.
Comment 8•6 years ago
|
||
The opt builds (B) on beta were also hitting timeouts. But eparately bstack removed c4.4xlarge from the allowed instance types for gecko-3-b-win2012 (https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/view). I'm not sure if the intention is to do that permanently, but it would solve the timeout issue.
Comment 9•6 years ago
|
||
The opt builds on beta are PGO, their timeout was probably borderline before.
Comment hidden (Intermittent Failures Robot) |
Comment 11•6 years ago
|
||
regarding the removal of c4.4xlarge worker types from the allowed instance types mentioned in comment 8:
we're seeing difficulty obtaining enough c5 instances to keep pending times normal. eg: https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/health shows that many spot requests are rejected by ec2 with the message: InsufficientInstanceCapacity
if we need builds to occur on c5 only, we'll need to increase the number of regions we support building in (set up sccache and hg buckets and acls) so that it's possible to obtain the number of instances we require. currently we can build in:
eu-central-1
us-east-1
us-east-2
us-west-1
us-west-2
maybe we should look at adding sccache and hg s3 buckets and enabling builds in:
ap-south-1
ap-northeast-1
ap-northeast-2
ap-southeast-1
ap-southeast-2
ca-central-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
Comment 12•6 years ago
|
||
worker type gecko-3-b-win2012 was configured to only provision instances in us-east-1, us-west-1 & us-west-2.
within those three regions we only have 38 instances running and ~500 pending jobs.
i have added regions us-east-2 & eu-central-1 to the gecko-3-b-win2012 configuration (this now matches the gecko-1-b-win2012 configuration) in an effort to get more c5.4xlarge instances up and the queue reduced.
i know that in the past we have disabled us-east-2 & eu-central-1 because of hg issues (slow clone times), so if we see problems relating to that again or other region issues (maybe sccache), we'll need to remove those regions again and rethink what we're going to do.
the configuration that i added regions to is here (in case we need to revert my changes):
https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/edit
Comment 13•6 years ago
|
||
i have added c4 instance types back to the gecko-3-b-win2012 worker type definition in an effort to get some instances running. the extra regions did not give us any new instances after an hour or so of waiting there are still only 38 instances and the queue is growing despite the trees being closed.
also there are errors like this in papertrail:
"ERROR aws-provisioner-production: error provisioning this worker type, skipping (workerType=gecko-3-b-win2012, err={})"
https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=989867078829891597
the most recent health message at: https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/health is more than 16 hours old.
i'm not sure why nothing has been provisioned in more than 15 hours and have been unable to get a response to !t-rex on #taskcluster.
Comment 14•6 years ago
|
||
i also removed the new regions (use2,euc1) for gecko-3-b-win2012.
Comment 15•6 years ago
|
||
there are 600 instances up now. all c4 except the 38 already running c5 instances.
No longer depends on: 1500361
Comment 16•6 years ago
|
||
Win2012 opt B & Bmsvc started to timeout again after this merge to beta: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?changeset=01378c910610cd214b2838650d0d2b7218fa8b5d
TH link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&resultStatus=success%2Cbusted&searchStr=win%2C2012%2Copt&tochange=01378c910610cd214b2838650d0d2b7218fa8b5d&fromchange=8efe26839243319464f00a472363e392de27cd4a&selectedJob=206720956
When this landed on beta and central the builds were green under the 120mins mark (e.g. beta http://tinyurl.com/y8t6jf3b - 100mins central http://tinyurl.com/ycst9ejc - 56 mins) so not sure if that really addressed the issue because two days later they still timed out out after 120mins. http://tinyurl.com/yb37ezoe
Comment hidden (Intermittent Failures Robot) |
Comment 18•6 years ago
|
||
Build times on Beta and Central aren't directly comparable because Beta has PGO enabled by default and m-c doesn't (you'd need to compare Beta vs. m-c PGO to get a more fair comparison). Anyway, yes, the underlying issue is still there - PGO builds on c4 instances are prone to timing out and ones on c5 instances aren't, which is basically luck of the draw.
That said, I'll bump the build timeout on Beta64 to at least cut down on the noise for now until we can come up with something more robust. I'm not going to touch the MSVC builds, though, since those should be fixed once bug 1500290 is sorted out (shudder).
Comment 19•6 years ago
|
||
Bump the timeout on Beta for Windows opt builds to match other PGO-enabled builds:
https://hg.mozilla.org/releases/mozilla-beta/rev/aba25ceab10f
Comment 20•6 years ago
|
||
I wonder if it's worthwhile to configure a separate worker type that will only provide c5.4xlarge instances, and change our task definitions so that PGO builds use that worker type, where other builds can use the standard gecko-3-b-win2012? If we have trouble maintaining capacity without m4.4xlarge instances in the mix it doesn't seem sensible to force all our builds onto the larger instances.
Comment 21•6 years ago
|
||
this pr adds worker types gecko-3-b-win2012-c4 & gecko-3-b-win2012-c5 to the occ ci ami builds. note that this will only create the extra worker types. once we verify that they work as expected, we can look at changing builds to make use of them (or not, depending on what's decided about how to move forward)
Attachment #9019137 -
Flags: review?(mcornmesser)
Updated•6 years ago
|
Attachment #9019137 -
Flags: review?(mcornmesser) → review+
Comment hidden (Intermittent Failures Robot) |
![]() |
Reporter | |
Comment 23•6 years ago
|
||
First beta merge is next Monday and the Windows opt build times will have to be increased again from 2h to 3h. Does anything else need to be done?
Flags: needinfo?(rthijssen)
Comment 24•6 years ago
|
||
my suggestion would be to explicitly use worker type: gecko-3-b-win2012-c5 (instead of gecko-3-b-win2012)
> - gecko-3-b-win2012-c4 always uses the slower c4.4xlarge ec2 instance types.
> - gecko-3-b-win2012-c5 always uses the faster c5.4xlarge ec2 instance types.
> - gecko-3-b-win2012 is a lottery and may use either c4.4xlarge or c5.4xlarge ec2 instance types depending on what the provisioner asks for and what ec2 has available at the time of the spot request.
in-tree changes will be needed to specify that gecko-3-b-win2012-c5 should be used for those builds, but what changes are needed is beyond my expertise.
Flags: needinfo?(rthijssen)
Comment 25•6 years ago
|
||
Do we know if there is more c5.4xlarge spot-capacity now than at the time of comment #13 ? The reduced build times with that family are great but it'd be a worry if the builds are pinned to a scarce resource.
Comment 26•6 years ago
|
||
Can we force only release branches onto the c5 workers and leave trunk for whatever's available (c5 preferred)?
Comment hidden (Intermittent Failures Robot) |
![]() |
Reporter | |
Comment 28•6 years ago
|
||
bugherder uplift |
status-firefox65:
--- → fixed
Comment 30•6 years ago
|
||
No longer seeing this occur on latest beta-sim:
Last occurrence was on the 4th of January: https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=56edc08ffce4b019a75ba90f55b87c55f9d2bead&selectedJob=219953604
![]() |
Reporter | |
Comment 31•6 years ago
|
||
uplift |
Build times are still close to the timeout, have applied the timeout increase once more: https://hg.mozilla.org/releases/mozilla-beta/rev/31f926caf2452050575ab84e54e6cdfbf6b2cbfc
![]() |
Reporter | |
Updated•6 years ago
|
status-firefox66:
--- → fixed
Comment hidden (Intermittent Failures Robot) |
![]() |
Reporter | |
Comment 33•6 years ago
|
||
Closing this for now, hasn't failed since Gecko 67 landed on beta 3 days ago. https://treeherder.mozilla.org/logviewer.html#?job_id=233898544&repo=mozilla-beta run for 5 minutes less than the limit allowed.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Updated•6 years ago
|
Keywords: leave-open
Updated•6 years ago
|
Keywords: regression
You need to log in
before you can comment on or make changes to this bug.
Description
•