Closed Bug 1555894 Opened 5 years ago Closed 5 years ago

32.33% build times (windows-mingw32) regression on push range 578cc4c154efb9d1d54da26e2f2818df88b3971e..67eb2cac32c2db9e105ef3b38480c8de56f0ddfe

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- disabled
firefox68 --- disabled
firefox69 --- disabled
firefox70 --- fixed

People

(Reporter: igoldan, Assigned: glandium)

References

(Regression)

Details

(Keywords: in-triage, regression)

Attachments

(2 files)

We have detected a build metrics regression from this push range:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=cb21c0b524ff08b12769387a1f8d3fa42258cb22&tochange=67eb2cac32c2db9e105ef3b38480c8de56f0ddfe

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

32% build times windows-mingw32 all 32 clang opt taskcluster-c5.4xlarge 1,536.21 -> 2,032.93

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=21162

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Product: Testing → Firefox Build System
Flags: needinfo?(bugspam.Callek)

:Callek, out of all the bugs from the push range, bug 1547730 seems most related to this issue.

I did first suspect my patchset as well, but then I noticed the range listed for this bug was from the first landing, which included the backout (https://hg.mozilla.org/integration/autoland/rev/ee4b88439111cf03944808dc170dbefa74fbdab0) so I'm calling this one "not-me".

Flags: needinfo?(bugspam.Callek)

Given that it only affects a Tier 2 build, I think we can just live with this regression.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Status: RESOLVED → REOPENED
Component: General → Javascript: WebAssembly
Flags: needinfo?(bbouvier)
Product: Firefox Build System → Core
Regressed by: 1547682
No longer regressed by: 1547730
Resolution: WONTFIX → ---

Also, https://treeherder.mozilla.org/perf.html#/alerts?id=20985 is actually about this regression.

Hmm, we've had a few build time regressions for Cranelift and my understanding is the following:

  • Rust build times are long, it's a fact of life.
  • It's hard to investigate whether we can reduce them, since there's no good tooling to do so.
  • We rely on the Rust compiler getting better at some point.
  • We'd like to ship Cranelift at some point in the future.

So considering all this, we've taken these regressions in the past. Unless there's something obvious and actionable (and I'd be happier knowing there's tooling to help identifying pain points in the Rust build!), I think the most sensible course of action is to take this build time regression too. (At the moment, it's Nightly only / hidden behind a configure flag, so it won't affect other builds.)

Is this reasonable?

Flags: needinfo?(bbouvier)

(In reply to Benjamin Bouvier [:bbouvier] from comment #6)

Hmm, we've had a few build time regressions for Cranelift and my understanding is the following:

  • Rust build times are long, it's a fact of life.
  • It's hard to investigate whether we can reduce them, since there's no good tooling to do so.
  • We rely on the Rust compiler getting better at some point.
  • We'd like to ship Cranelift at some point in the future.

So considering all this, we've taken these regressions in the past. Unless there's something obvious and actionable (and I'd be happier knowing there's tooling to help identifying pain points in the Rust build!), I think the most sensible course of action is to take this build time regression too. (At the moment, it's Nightly only / hidden behind a configure flag, so it won't affect other builds.)

Is this reasonable?

Realistically, no. We have enough push-back on adding more rust features due to build perf issues. This hit is quite large for a feature that isn't ready for daily use. This affects every dev buid and the majority of our automation builds. Ideally we would disable cranelift by default until we want to ship it in a real build and hide it behind and a --enable-cranelfit flag. Once the functionality and perf are at a point where we want to enable by default on Nightly the build team and rustc teams should be consulted to see if there are obvious wins for the build time impact.

(In reply to Eric Rahm [:erahm] (Out until Aug 6th) from comment #7)

(In reply to Benjamin Bouvier [:bbouvier] from comment #6)

Hmm, we've had a few build time regressions for Cranelift and my understanding is the following:

  • Rust build times are long, it's a fact of life.
  • It's hard to investigate whether we can reduce them, since there's no good tooling to do so.
  • We rely on the Rust compiler getting better at some point.
  • We'd like to ship Cranelift at some point in the future.

So considering all this, we've taken these regressions in the past. Unless there's something obvious and actionable (and I'd be happier knowing there's tooling to help identifying pain points in the Rust build!), I think the most sensible course of action is to take this build time regression too. (At the moment, it's Nightly only / hidden behind a configure flag, so it won't affect other builds.)

Is this reasonable?

Realistically, no. We have enough push-back on adding more rust features due to build perf issues. This hit is quite large for a feature that isn't ready for daily use. This affects every dev buid and the majority of our automation builds. Ideally we would disable cranelift by default until we want to ship it in a real build and hide it behind and a --enable-cranelfit flag. Once the functionality and perf are at a point where we want to enable by default on Nightly the build team and rustc teams should be consulted to see if there are obvious wins for the build time impact.

Re-reading this it sounds like this is nightly builds only and not all m-c, autoland, m-i buids. If that's the case I'm less concerned, but it would be helpful to have confirmation. If it's all automation builds I'm still concerned.

You were right in your first comment: this is affecting all m-i, m-c and autoland builds. It is not affecting mozilla beta / release etc., since Cranelift is nightly-only.

It is already possible to use --disable-cranelift to not compile it. Note that there are other Rust crates taking time to build as part of Spidermonkey for other experimental features, like BinAST.

My primary motivation to keep it in the build is that it has been a major pain (Nathan / Glandium can testify) to have Cranelift build as part of Spidermonkey, on all platforms, so I really really don't want to regress here. On local machines the compilation time can be an issue too, although it's helped us spot other build issues not caught in automation.

If the automation watch system can be very precise, and find that anything related to the build system, or files in js/src/wasm/cranelift/**/*, or js/src/wasm/WasmCraneliftCompile.{h,cpp} have changed, then we could have a specific taskcluster task to trigger the Cranelift build only in this case, and disable it by default in general. I don't know if this is feasible.

Another possibility I am thinking of: I think we can trigger regular build tasks, so we could disable Cranelift by default, and build it once per week in automation, or something like this. If the build is broken, we'd use bisections to find which commits regressed it, and should consider that the regressing commits' authors should fix the Cranelift build.

Any other option is not very attractive to me: disabling Cranelift on non-x64, or disabling entirely in general without ever testing it in automation could break the build, and fixing it could require another several months of investigation / fixes. In general, I just don't want the responsibility of fixing the Cranelift build to come back to the WebVM team: it's been a terrible, close-to-burnout provoking experience to have to deal with build system issues on all the platforms. I understand the build peers are already lacking manpower and couldn't do one-shot fixes for it, hence my stress on keeping the build enabled everywhere, so build fixes are brought in in smaller chunks.

To wit: in the short to mid- term, Cranelift might get enabled by default on some platforms to compile WebAssembly, which is part of Mozilla's strategies to lead in the future and explore new technologies. In the long run, Cranelift is likely to replace some even bigger parts of Spidermonkey, including but not limited to code generation for IonMonkey (our JIT compiler for JS code).

Some data:
This is build times of cranelift vs --disable-cranelift for automation builds, with sccache disabled:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=179c08fc7fd78b645d710619fb87596dfe92965a&newProject=try&newRevision=811b64ec4393287699f5970f3e126683e51bbdac&framework=2
Apart from the weird times on windows on c4.4xlarge and c5.4xlarge, it shows a real hit. With sccache enabled, I suspect the difference would be larger because of the build scripts execution (those aren't cached)

This is build times of cranelift vs --disable-cranelift for developer builds. Main differences with above is that libxul-gtest is not built, rust LTO is disabled, and less tiers are executed:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8533ad07ca5bd64c01bac7d59c7c8ebefeffb11d&newProject=try&newRevision=603a9357288bd3e93fe73903c97d00b793818b29&framework=2

Mike, Eric - any reaction to these suggestions from Benjamin as to how to break the impasse?

If the automation watch system can be very precise, and find that anything related to the build system, or files in js/src/wasm/cranelift/**/*, or js/src/wasm/WasmCraneliftCompile.{h,cpp} have changed, then we could have a specific taskcluster task to trigger the Cranelift build only in this case, and disable it by default in general. I don't know if this is feasible.

Another possibility I am thinking of: I think we can trigger regular build tasks, so we could disable Cranelift by default, and build it once per week in automation, or something like this. If the build is broken, we'd use bisections to find which commits regressed it, and should consider that the regressing commits' authors should fix the Cranelift build.

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(erahm)
Priority: -- → P2

(In reply to Lars T Hansen [:lth] from comment #11)

Mike, Eric - any reaction to these suggestions from Benjamin as to how to break the impasse?

I discussed this with Mike. We think the best option is to disable by default, but keep building standalone SpiderMonkey with cranelift enabled so that we can make sure everything builds okay across platforms. When we get to the point where we want to test out cranelift for WASM compilation we can enable it via the in-tree mozconfigs for the platforms we want to target.

Flags: needinfo?(erahm)
Flags: needinfo?(mh+mozilla)
Assignee: nobody → mh+mozilla

OK, we'll create a manually triggered weekly test run to safeguard compilation with Firefox, details TBD.

Out of curiosity, did the landing of https://hg.mozilla.org/mozilla-central/rev/d4b4afb9cd96 change anything about build times? It removed the last Python bits used during the cargo build step (which we assumed caused the slowness here).

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(igoldan)
Attached image image.png

(In reply to Benjamin Bouvier [:bbouvier] from comment #15)

Out of curiosity, did the landing of https://hg.mozilla.org/mozilla-central/rev/d4b4afb9cd96 change anything about build times? It removed the last Python bits used during the cargo build step (which we assumed caused the slowness here).

Not that I can tell, certainly not to the amount of the regression. You can take a look at the graphs in perfherder (see comment 0).

Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/6eccfe79d02f
Only enable cranelift on JS standalone builds. r=lth,nalexander
Flags: needinfo?(mh+mozilla)
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

== Change summary for alert #22538 (as of Mon, 19 Aug 2019 20:16:46 GMT) ==

Improvements:

12% build times windows2012-64-shippable opt instrumented taskcluster-c5.4xlarge 1,797.00 -> 1,584.59
10% build times windows2012-64-shippable opt instrumented taskcluster-c4.4xlarge 2,187.01 -> 1,967.05
9% build times windows2012-32-shippable opt instrumented taskcluster-c4.4xlarge 2,183.88 -> 1,996.73
4% build times windows2012-64-shippable opt nightly taskcluster-c4.4xlarge 4,416.64 -> 4,227.99

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=22538

Flags: needinfo?(igoldan)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: