Closed Bug 1506511 Opened 7 years ago Closed 6 years ago

3.96 - 22.09% build times (windows2012-32, windows2012-64, windows2012-64-noopt) regression on push d2013de59e87ff63899ef1e634bd3889ec126c93 (Fri Nov 9 2018)

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- wontfix
firefox63 --- unaffected
firefox64 --- unaffected
firefox65 + disabled
firefox66 --- wontfix
firefox69 --- wontfix
firefox70 --- wontfix

People

(Reporter: igoldan, Assigned: sunfish)

References

Details

(Keywords: regression)

We have detected a build metrics regression from push: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=d2013de59e87ff63899ef1e634bd3889ec126c93 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 22% build times windows2012-64 opt plain taskcluster-c4.4xlarge 2,199.75 -> 2,685.61 18% build times windows2012-64 opt plain taskcluster-c5.4xlarge 1,966.33 -> 2,317.40 13% build times windows2012-64 asan opt taskcluster-c4.4xlarge 2,831.88 -> 3,200.00 13% build times windows2012-64 opt taskcluster-c4.4xlarge 2,748.70 -> 3,099.56 13% build times windows2012-32 opt taskcluster-c4.4xlarge 2,705.11 -> 3,048.81 5% build times windows2012-64 debug plain taskcluster-c4.4xlarge 2,067.47 -> 2,162.76 4% build times windows2012-64-noopt debug taskcluster-c4.4xlarge 2,034.48 -> 2,115.01 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=17471 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format. To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics
Component: General → Javascript: Web Assembly
Product: Testing → Core
Flags: needinfo?(sunfish)
It's not immediately obvious to me what this push might have done that would cause this kind of regression. We added more code, but not an inordinate amount. I'll work on getting a machine to reproduce this on, but in the mean time, if anyone has any ideas about what might have happened here, I'd much appreciate it!
OS: Windows → All
It looks like Linux got affected as well.
The patch included a change to one of the build scripts which makes it print out how much time it took, and the path of its output directory. These are non-deterministic, and it's my current theory that this is preventing the build script's output from being cached. I created https://github.com/CraneStation/cranelift/pull/604 to remove this output. Once that lands upstream I'll submit a patch here for mozilla-inbound.
Depends on: 1507819
Does it get addressed with the bump of Cranelift in bug 1507819 which has landed?
Flags: needinfo?(igoldan)
(In reply to Benjamin Bouvier [:bbouvier] from comment #4) > Does it get addressed with the bump of Cranelift in bug 1507819 which has > landed? No, I didn't see any kind of improvement.
Flags: needinfo?(igoldan)
Assignee: nobody → sunfish
I don't have any other ideas currently. If we need a fix promptly, we should disable Cranelift until we can figure what's going on.
(In reply to Dan Gohman [:sunfish] from comment #6) > I don't have any other ideas currently. If we need a fix promptly, we should > disable Cranelift until we can figure what's going on. :froydnj are you the person to decide what to do next regarding this issue?
Flags: needinfo?(nfroyd)
A few notes: - Cranelift is enabled only on Nightlies, so this build time regression will not propagate to other channels. - We could (and maybe should) disable Cranelift compilation on platforms that aren't x64, because it can't generate code for these platforms anyway. At the same time, the integration into the build system has been complicated enough that I wouldn't want anybody to have to do it again; so it is nice to keep on compiling on all the platforms, just to make sure we are not regressing the build. Cranelift doesn't have features to disable code generation on specific platforms yet, so maybe we should bump this issue https://github.com/CraneStation/cranelift/issues/13 . That being said, we would still have the build time regression on all the 64 bits architectures. That still sounds like the best course of action to me, that is, to implement issue #13 and then disable the code generation features on other platforms, while keeping the Cranelift build enabled by default in Gecko. - I don't know how to profile build times in general. Do we have tooling for this?
I can decide what we can do. I talked to bbouvier this morning, and one thing to look at is whether the recent cranelift update changed the amount of code we can sccache (sccache cacheable requests). Can you (Ionut) look at this? Otherwise, there have been some changes to how we compile Rust code recently that may have helped this issue. We should wait and see for now; I don't think we should back this code out or anything like that.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #9) > I talked to bbouvier this morning, and one thing to look at is whether the > recent cranelift update changed the amount of code we can sccache (sccache > cacheable requests). Can you (Ionut) look at this? sccache requests_not_cacheable experienced almost no changes during our build times regressions. I checked this on Linux and Windows 64bit.

(In reply to Nathan Froyd [:froydnj] from comment #9)

Otherwise, there have been some changes to how we compile Rust code recently
that may have helped this issue. We should wait and see for now; I don't
think we should back this code out or anything like that.

Can we resume the investigation into this, given comment 10?

Flags: needinfo?(nfroyd)

We could try. Now that I think about it, it's quite peculiar that the regression only shows up on Windows, and not any of our other platforms.

Cranelift invokes Python somewhere during its build process (or maybe it used to?)...is it possible that we're invoking Python a lot, and we don't notice it on Unix because new process creation is so face, but we do notice it on Windows?

Flags: needinfo?(nfroyd) → needinfo?(bbouvier)

If so, then one possible approach would be to check the generated files into the tree, rather than generating them at build time. That would avoid the need to run any python for typical builds. That'd require developers working on those files to manually run the scripts when making changes, but we might be ok with that if it fixes our build-time regression.

OK, so it looks like we only invoke Python once:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/build.rs#L65

so the multiple invocations of Python theory seems to not hold water.

I would say that the update_file function doesn't try to avoid updating mtimes by checking contents before writing:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/meta-python/srcgen.py#L86-L91

But I would think that cargo ought to not run the build script in the first place, so that...shouldn't matter?

It seems strange to me that we're telling cargo to rerun-if-changed even for directory changes:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/meta-python/gen_build_deps.py#L25-L43

Maybe the sourcedirs are getting updated (mtimes?) on Windows (but not Unix-y systems), but not the actual files, and therefore Cargo thinks that everything needs to be recompiled, and that's what's triggering constant rebuilds of cranelift?

Any updates here?

Flags: needinfo?(nfroyd)

Cranelift invokes Python somewhere during its build process (or maybe it used to?)...is it possible that we're invoking Python a lot, and we don't notice it on Unix because new process creation is so face, but we do notice it on Windows?

Python is called only once as you found out, so this can't be this. Maybe the sources being regenerated over time would prevent cache hits?

Is there an explanation of how the Talos measures build time? Do we have any tool to measure build time of a subcomponent or sub-directory, or anything really that's not measuring the time it takes to build the whole tree?

(We could do an experiment in Cranelift, where the sources aren't dynamically generated (the build system would pass this option when building Cranelift), and instead we'd copy them in tree, as suggested in comment 14. Then we can see if the build time regression disappears or not, and investigate more precisely, or eliminate this as a possible solution.)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #16)

Any updates here?

To wit: this is a Nightly-only regression; Cranelift is not getting built at all on other releases.

Flags: needinfo?(bbouvier)

https://github.com/CraneStation/cranelift/pull/656 implemented the directory changes idea in comment 15. We'll see if that helps on the next cranelift import, I guess?

(In reply to Benjamin Bouvier [:bbouvier] from comment #17)

Python is called only once as you found out, so this can't be this. Maybe the sources being regenerated over time would prevent cache hits?

Maybe? If the sources being regenerated caused mtime updates, I would expect build time regressions on all platforms, not just Windows.

Is there an explanation of how the Talos measures build time? Do we have any tool to measure build time of a subcomponent or sub-directory, or anything really that's not measuring the time it takes to build the whole tree?

We do not, unfortunately.

(We could do an experiment in Cranelift, where the sources aren't dynamically generated (the build system would pass this option when building Cranelift), and instead we'd copy them in tree, as suggested in comment 14. Then we can see if the build time regression disappears or not, and investigate more precisely, or eliminate this as a possible solution.)

I think I'd attempt this via try first, to avoid polluting cranelift's history. The regression is big enough that doing this, and several try builds ought to be enough to tell you whether things got fixed or not.

Flags: needinfo?(nfroyd)

What's the try command to get measures for the build time, please? I'd like to try a few things.

(In reply to Benjamin Bouvier [:bbouvier] from comment #19)

What's the try command to get measures for the build time, please? I'd like to try a few things.

I believe it is mach try fuzzy --no-artifact --query "build-win64-plain/opt | build-win64-asan/opt | build-win64/opt | build-win64/debug | build-win64-noopt/debug"

Did you get a chance to try to reproduce, Benjamin?

Flags: needinfo?(bbouvier)

No, I haven't yet. I think the first step would be to do a bisection to confirm which change in Gecko introduced this regression, then in Cranelift to see if there's an obvious commit that raised compilation time (it might a death-by-a-thousand-commits scenario too).

Flags: needinfo?(bbouvier)

(In reply to Benjamin Bouvier [:bbouvier] from comment #22)

No, I haven't yet. I think the first step would be to do a bisection to confirm which change in Gecko introduced this regression, then in Cranelift to see if there's an obvious commit that raised compilation time (it might a death-by-a-thousand-commits scenario too).

Ionuț, is this something you can do?

Flags: needinfo?(igoldan)

(I'm going to mark this fix-optional for 66 since a build time regression doesn't affect users)

(In reply to Andrew Overholt [:overholt] from comment #23)

(In reply to Benjamin Bouvier [:bbouvier] from comment #22)

No, I haven't yet. I think the first step would be to do a bisection to confirm which change in Gecko introduced this regression, then in Cranelift to see if there's an obvious commit that raised compilation time (it might a death-by-a-thousand-commits scenario too).

Ionuț, is this something you can do?

I don't think bisecting on Cranelift is something I could work on.
As for looking over Gecko changes: I need more details, to understand whether I could help here.

Flags: needinfo?(igoldan)
Keywords: stalled
Priority: -- → P3

(Note this should now be addressed by the landing of bug 1555894, if this was Cranelift that caused the long build times.)

Flags: needinfo?(sunfish) → needinfo?(igoldan)

(In reply to Benjamin Bouvier [:bbouvier] from comment #26)

(Note this should now be addressed by the landing of bug 1555894, if this was Cranelift that caused the long build times.)

Indeed, all regressions here seem fixed now.

Flags: needinfo?(igoldan)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
Target Milestone: --- → mozilla70
You need to log in before you can comment on or make changes to this bug.