1500274 - performance regression in Windows build times, central and beta builds often timing out

FTR, Win PGO builds already have a 3hr timeout, so I think it's reasonable to do the same for Nightly builds. More observations - jobs on c5.4xlarge instances run successfully in around 100min. Jobs on c4.4xlarge instances are the ones timing. Win PGO builds (with the aforementioned 3hr timeouts) are finishing on c4.4xlarge instances in ~125min, so we're *just* barely going over the limit here.

Mike Hommey [:glandium]

Comment 4

•

6 years ago

There are multiple things going wrong here, and increasing the timeouts is only hiding the problem. One is that the msvc builds on beta are not using the msvc mozconfigs, so they are effectively clang-cl builds. Another is the regression in build times, and seeing the corresponding alert (https://treeherder.mozilla.org/perf.html#/alerts?id=16907) it looks like there might be something fishy going on with sccache, but it also looks like there is a legit build time regression from bug 1486554.

Mike Hommey [:glandium]

Comment 5

•

6 years ago

(In reply to Mike Hommey [:glandium] from comment #4) > One is that the msvc builds on beta are not using the msvc mozconfigs, so > they are effectively clang-cl builds. Filed bug 1500290.

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 6

•

6 years ago

(In reply to Mike Hommey [:glandium] from comment #4) > it looks like there might be something fishy going on with sccache Filed bug 1500295.

Mike Hommey [:glandium]

Comment 7

•

6 years ago

There is a consistent ~12-13% regression in build times from using the plugin on Windows. I measured what it looks like on linux64, and it's in the order of ~9-10%, so it's only slightly slower, which is not entirely surprising, as on Windows, the plugin contains all the LLVM and clang code it uses, effectively duplicating what's in the clang binary. That's not the case on Linux. So short of fixing sccache on Windows (bug 1476604), there's not much we can do in the short term besides increasing the timeouts. We may want to look more closely whether there are some avoidable performance pitfalls in the plugin implementation separately (which would improve the performance on both Linux and Windows). I'll gather some more data and file a separate bug for that. Please close this bug if the timeout increases addressed the issue on all the builds that were affected.

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

6 years ago

The opt builds (B) on beta were also hitting timeouts. But eparately bstack removed c4.4xlarge from the allowed instance types for gecko-3-b-win2012 (https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/view). I'm not sure if the intention is to do that permanently, but it would solve the timeout issue.

Mike Hommey [:glandium]

Comment 9

•

6 years ago

The opt builds on beta are PGO, their timeout was probably borderline before.

Comment hidden (Intermittent Failures Robot)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 11

•

6 years ago

regarding the removal of c4.4xlarge worker types from the allowed instance types mentioned in comment 8: we're seeing difficulty obtaining enough c5 instances to keep pending times normal. eg: https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/health shows that many spot requests are rejected by ec2 with the message: InsufficientInstanceCapacity if we need builds to occur on c5 only, we'll need to increase the number of regions we support building in (set up sccache and hg buckets and acls) so that it's possible to obtain the number of instances we require. currently we can build in: eu-central-1 us-east-1 us-east-2 us-west-1 us-west-2 maybe we should look at adding sccache and hg s3 buckets and enabling builds in: ap-south-1 ap-northeast-1 ap-northeast-2 ap-southeast-1 ap-southeast-2 ca-central-1 eu-west-1 eu-west-2 eu-west-3 sa-east-1

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 12

•

6 years ago

worker type gecko-3-b-win2012 was configured to only provision instances in us-east-1, us-west-1 & us-west-2. within those three regions we only have 38 instances running and ~500 pending jobs. i have added regions us-east-2 & eu-central-1 to the gecko-3-b-win2012 configuration (this now matches the gecko-1-b-win2012 configuration) in an effort to get more c5.4xlarge instances up and the queue reduced. i know that in the past we have disabled us-east-2 & eu-central-1 because of hg issues (slow clone times), so if we see problems relating to that again or other region issues (maybe sccache), we'll need to remove those regions again and rethink what we're going to do. the configuration that i added regions to is here (in case we need to revert my changes): https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/edit

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 13

•

6 years ago

i have added c4 instance types back to the gecko-3-b-win2012 worker type definition in an effort to get some instances running. the extra regions did not give us any new instances after an hour or so of waiting there are still only 38 instances and the queue is growing despite the trees being closed. also there are errors like this in papertrail: "ERROR aws-provisioner-production: error provisioning this worker type, skipping (workerType=gecko-3-b-win2012, err={})" https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=989867078829891597 the most recent health message at: https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/health is more than 16 hours old. i'm not sure why nothing has been provisioned in more than 15 hours and have been unable to get a response to !t-rex on #taskcluster.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 14

•

6 years ago

i also removed the new regions (use2,euc1) for gecko-3-b-win2012.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

6 years ago

Depends on: 1500361

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 15

•

6 years ago

there are 600 instances up now. all c4 except the 38 already running c5 instances.

No longer depends on: 1500361

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

6 years ago

Depends on: 1500361

Cosmin Sabou [:CosminS]

Comment 16

•

6 years ago

Win2012 opt B & Bmsvc started to timeout again after this merge to beta: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?changeset=01378c910610cd214b2838650d0d2b7218fa8b5d TH link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&resultStatus=success%2Cbusted&searchStr=win%2C2012%2Copt&tochange=01378c910610cd214b2838650d0d2b7218fa8b5d&fromchange=8efe26839243319464f00a472363e392de27cd4a&selectedJob=206720956 When this landed on beta and central the builds were green under the 120mins mark (e.g. beta http://tinyurl.com/y8t6jf3b - 100mins central http://tinyurl.com/ycst9ejc - 56 mins) so not sure if that really addressed the issue because two days later they still timed out out after 120mins. http://tinyurl.com/yb37ezoe

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Comment 18

•

6 years ago

Build times on Beta and Central aren't directly comparable because Beta has PGO enabled by default and m-c doesn't (you'd need to compare Beta vs. m-c PGO to get a more fair comparison). Anyway, yes, the underlying issue is still there - PGO builds on c4 instances are prone to timing out and ones on c5 instances aren't, which is basically luck of the draw. That said, I'll bump the build timeout on Beta64 to at least cut down on the noise for now until we can come up with something more robust. I'm not going to touch the MSVC builds, though, since those should be fixed once bug 1500290 is sorted out (shudder).

Ryan VanderMeulen [:RyanVM]

Comment 19

•

6 years ago

Bump the timeout on Beta for Windows opt builds to match other PGO-enabled builds: https://hg.mozilla.org/releases/mozilla-beta/rev/aba25ceab10f

(not currently active) Ted Mielczarek

Comment 20

•

6 years ago

I wonder if it's worthwhile to configure a separate worker type that will only provide c5.4xlarge instances, and change our task definitions so that PGO builds use that worker type, where other builds can use the standard gecko-3-b-win2012? If we have trouble maintaining capacity without m4.4xlarge instances in the mix it doesn't seem sensible to force all our builds onto the larger instances.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 21

•

6 years ago

Attached file GitHub Pull Request — Details

this pr adds worker types gecko-3-b-win2012-c4 & gecko-3-b-win2012-c5 to the occ ci ami builds. note that this will only create the extra worker types. once we verify that they work as expected, we can look at changing builds to make use of them (or not, depending on what's decided about how to move forward)

Attachment #9019137 - Flags: review?(mcornmesser)

Mark Cornmesser [:markco]

Updated

•

6 years ago

Attachment #9019137 - Flags: review?(mcornmesser) → review+

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 23

•

6 years ago

First beta merge is next Monday and the Windows opt build times will have to be increased again from 2h to 3h. Does anything else need to be done?

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 24

•

6 years ago

my suggestion would be to explicitly use worker type: gecko-3-b-win2012-c5 (instead of gecko-3-b-win2012) > - gecko-3-b-win2012-c4 always uses the slower c4.4xlarge ec2 instance types. > - gecko-3-b-win2012-c5 always uses the faster c5.4xlarge ec2 instance types. > - gecko-3-b-win2012 is a lottery and may use either c4.4xlarge or c5.4xlarge ec2 instance types depending on what the provisioner asks for and what ec2 has available at the time of the spot request. in-tree changes will be needed to specify that gecko-3-b-win2012-c5 should be used for those builds, but what changes are needed is beyond my expertise.

Flags: needinfo?(rthijssen)

Nick Thomas [:nthomas] (UTC+12)

Comment 25

•

6 years ago

Do we know if there is more c5.4xlarge spot-capacity now than at the time of comment #13 ? The reduced build times with that family are great but it'd be a worry if the builds are pinned to a scarce resource.

Ryan VanderMeulen [:RyanVM]

Comment 26

•

6 years ago

Can we force only release branches onto the c5 workers and leave trunk for whatever's available (c5 preferred)?

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 28

•

6 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/c9f6bd2a826d

status-firefox65: --- → fixed

Andreea Pavel [:apavel]

Comment 30

•

6 years ago

No longer seeing this occur on latest beta-sim:

https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=9bd35cae6174983ac330fe579bc0d82976aa5c6b

Last occurrence was on the 4th of January: https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=56edc08ffce4b019a75ba90f55b87c55f9d2bead&selectedJob=219953604

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 31

•

6 years ago

uplift

Build times are still close to the timeout, have applied the timeout increase once more: https://hg.mozilla.org/releases/mozilla-beta/rev/31f926caf2452050575ab84e54e6cdfbf6b2cbfc

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

6 years ago

status-firefox66: --- → fixed

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 33

•

6 years ago

Closing this for now, hasn't failed since Gecko 67 landed on beta 3 days ago. https://treeherder.mozilla.org/logviewer.html#?job_id=233898544&repo=mozilla-beta run for 5 minutes less than the limit allowed.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WORKSFORME

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Keywords: leave-open

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Keywords: regression