1506511 - 3.96 - 22.09% build times (windows2012-32, windows2012-64, windows2012-64-noopt) regression on push d2013de59e87ff63899ef1e634bd3889ec126c93 (Fri Nov 9 2018)

Reporter

Description

•

7 years ago

We have detected a build metrics regression from push: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=d2013de59e87ff63899ef1e634bd3889ec126c93 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 22% build times windows2012-64 opt plain taskcluster-c4.4xlarge 2,199.75 -> 2,685.61 18% build times windows2012-64 opt plain taskcluster-c5.4xlarge 1,966.33 -> 2,317.40 13% build times windows2012-64 asan opt taskcluster-c4.4xlarge 2,831.88 -> 3,200.00 13% build times windows2012-64 opt taskcluster-c4.4xlarge 2,748.70 -> 3,099.56 13% build times windows2012-32 opt taskcluster-c4.4xlarge 2,705.11 -> 3,048.81 5% build times windows2012-64 debug plain taskcluster-c4.4xlarge 2,067.47 -> 2,162.76 4% build times windows2012-64-noopt debug taskcluster-c4.4xlarge 2,034.48 -> 2,115.01 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=17471 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format. To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics

Ionuț Goldan [:igoldan]

Reporter

Updated

•

7 years ago

Component: General → Javascript: Web Assembly

Product: Testing → Core

Ionuț Goldan [:igoldan]

Reporter

Updated

•

7 years ago

Flags: needinfo?(sunfish)

Dan Gohman [:sunfish]

Assignee

Comment 1

•

7 years ago

It's not immediately obvious to me what this push might have done that would cause this kind of regression. We added more code, but not an inordinate amount. I'll work on getting a machine to reproduce this on, but in the mean time, if anyone has any ideas about what might have happened here, I'd much appreciate it!

Ionuț Goldan [:igoldan]

Reporter

Updated

•

7 years ago

OS: Windows → All

Ionuț Goldan [:igoldan]

Reporter

Comment 2

•

7 years ago

It looks like Linux got affected as well.

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox63: --- → unaffected

status-firefox64: --- → unaffected

status-firefox65: --- → affected

status-firefox-esr60: --- → unaffected

tracking-firefox65: --- → +

Dan Gohman [:sunfish]

Assignee

Comment 3

•

7 years ago

The patch included a change to one of the build scripts which makes it print out how much time it took, and the path of its output directory. These are non-deterministic, and it's my current theory that this is preventing the build script's output from being cached. I created https://github.com/CraneStation/cranelift/pull/604 to remove this output. Once that lands upstream I'll submit a patch here for mozilla-inbound.

Dan Gohman [:sunfish]

Assignee

Updated

•

7 years ago

Depends on: 1507819

Benjamin Bouvier [:bbouvier] (inactive)

Comment 4

•

7 years ago

Does it get addressed with the bump of Cranelift in bug 1507819 which has landed?

Flags: needinfo?(igoldan)

Ionuț Goldan [:igoldan]

Reporter

Comment 5

•

7 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #4) > Does it get addressed with the bump of Cranelift in bug 1507819 which has > landed? No, I didn't see any kind of improvement.

Flags: needinfo?(igoldan)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Assignee: nobody → sunfish

Dan Gohman [:sunfish]

Assignee

Comment 6

•

7 years ago

I don't have any other ideas currently. If we need a fix promptly, we should disable Cranelift until we can figure what's going on.

Ionuț Goldan [:igoldan]

Reporter

Comment 7

•

7 years ago

(In reply to Dan Gohman [:sunfish] from comment #6) > I don't have any other ideas currently. If we need a fix promptly, we should > disable Cranelift until we can figure what's going on. :froydnj are you the person to decide what to do next regarding this issue?

Flags: needinfo?(nfroyd)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 8

•

7 years ago

A few notes: - Cranelift is enabled only on Nightlies, so this build time regression will not propagate to other channels. - We could (and maybe should) disable Cranelift compilation on platforms that aren't x64, because it can't generate code for these platforms anyway. At the same time, the integration into the build system has been complicated enough that I wouldn't want anybody to have to do it again; so it is nice to keep on compiling on all the platforms, just to make sure we are not regressing the build. Cranelift doesn't have features to disable code generation on specific platforms yet, so maybe we should bump this issue https://github.com/CraneStation/cranelift/issues/13 . That being said, we would still have the build time regression on all the 64 bits architectures. That still sounds like the best course of action to me, that is, to implement issue #13 and then disable the code generation features on other platforms, while keeping the Cranelift build enabled by default in Gecko. - I don't know how to profile build times in general. Do we have tooling for this?

Nathan Froyd [:froydnj]

Comment 9

•

7 years ago

I can decide what we can do. I talked to bbouvier this morning, and one thing to look at is whether the recent cranelift update changed the amount of code we can sccache (sccache cacheable requests). Can you (Ionut) look at this? Otherwise, there have been some changes to how we compile Rust code recently that may have helped this issue. We should wait and see for now; I don't think we should back this code out or anything like that.

Flags: needinfo?(nfroyd)

Ionuț Goldan [:igoldan]

Reporter

Comment 10

•

7 years ago

(In reply to Nathan Froyd [:froydnj] from comment #9) > I talked to bbouvier this morning, and one thing to look at is whether the > recent cranelift update changed the amount of code we can sccache (sccache > cacheable requests). Can you (Ionut) look at this? sccache requests_not_cacheable experienced almost no changes during our build times regressions. I checked this on Linux and Windows 64bit.

Jared Wein [:jaws] (please needinfo? me)

Comment 11

•

7 years ago

Marking as disabled-65 due to comment 8.

status-firefox65: affected → disabled

status-firefox66: --- → affected

Ionuț Goldan [:igoldan]

Reporter

Comment 12

•

7 years ago

(In reply to Nathan Froyd [:froydnj] from comment #9)

Otherwise, there have been some changes to how we compile Rust code recently
that may have helped this issue. We should wait and see for now; I don't
think we should back this code out or anything like that.

Can we resume the investigation into this, given comment 10?

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 13

•

7 years ago

We could try. Now that I think about it, it's quite peculiar that the regression only shows up on Windows, and not any of our other platforms.

Cranelift invokes Python somewhere during its build process (or maybe it used to?)...is it possible that we're invoking Python a lot, and we don't notice it on Unix because new process creation is so face, but we do notice it on Windows?

Flags: needinfo?(nfroyd) → needinfo?(bbouvier)

Dan Gohman [:sunfish]

Assignee

Comment 14

•

7 years ago

If so, then one possible approach would be to check the generated files into the tree, rather than generating them at build time. That would avoid the need to run any python for typical builds. That'd require developers working on those files to manually run the scripts when making changes, but we might be ok with that if it fixes our build-time regression.

Nathan Froyd [:froydnj]

Comment 15

•

7 years ago

OK, so it looks like we only invoke Python once:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/build.rs#L65

so the multiple invocations of Python theory seems to not hold water.

I would say that the update_file function doesn't try to avoid updating mtimes by checking contents before writing:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/meta-python/srcgen.py#L86-L91

But I would think that cargo ought to not run the build script in the first place, so that...shouldn't matter?

It seems strange to me that we're telling cargo to rerun-if-changed even for directory changes:

https://github.com/CraneStation/cranelift/blob/master/lib/codegen/meta-python/gen_build_deps.py#L25-L43

Maybe the sourcedirs are getting updated (mtimes?) on Windows (but not Unix-y systems), but not the actual files, and therefore Cargo thinks that everything needs to be recompiled, and that's what's triggering constant rebuilds of cranelift?

Ionuț Goldan [:igoldan]

Reporter

Comment 16

•

7 years ago

Any updates here?

Flags: needinfo?(nfroyd)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 17

•

7 years ago

Cranelift invokes Python somewhere during its build process (or maybe it used to?)...is it possible that we're invoking Python a lot, and we don't notice it on Unix because new process creation is so face, but we do notice it on Windows?

Python is called only once as you found out, so this can't be this. Maybe the sources being regenerated over time would prevent cache hits?

Is there an explanation of how the Talos measures build time? Do we have any tool to measure build time of a subcomponent or sub-directory, or anything really that's not measuring the time it takes to build the whole tree?

(We could do an experiment in Cranelift, where the sources aren't dynamically generated (the build system would pass this option when building Cranelift), and instead we'd copy them in tree, as suggested in comment 14. Then we can see if the build time regression disappears or not, and investigate more precisely, or eliminate this as a possible solution.)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #16)

Any updates here?

To wit: this is a Nightly-only regression; Cranelift is not getting built at all on other releases.

Flags: needinfo?(bbouvier)

Nathan Froyd [:froydnj]

Comment 18

•

7 years ago

https://github.com/CraneStation/cranelift/pull/656 implemented the directory changes idea in comment 15. We'll see if that helps on the next cranelift import, I guess?

(In reply to Benjamin Bouvier [:bbouvier] from comment #17)

Python is called only once as you found out, so this can't be this. Maybe the sources being regenerated over time would prevent cache hits?

Maybe? If the sources being regenerated caused mtime updates, I would expect build time regressions on all platforms, not just Windows.

Is there an explanation of how the Talos measures build time? Do we have any tool to measure build time of a subcomponent or sub-directory, or anything really that's not measuring the time it takes to build the whole tree?

We do not, unfortunately.

(We could do an experiment in Cranelift, where the sources aren't dynamically generated (the build system would pass this option when building Cranelift), and instead we'd copy them in tree, as suggested in comment 14. Then we can see if the build time regression disappears or not, and investigate more precisely, or eliminate this as a possible solution.)

I think I'd attempt this via try first, to avoid polluting cranelift's history. The regression is big enough that doing this, and several try builds ought to be enough to tell you whether things got fixed or not.

Flags: needinfo?(nfroyd)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 19

•

7 years ago

What's the try command to get measures for the build time, please? I'd like to try a few things.

Ionuț Goldan [:igoldan]

Reporter

Comment 20

•

7 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #19)

What's the try command to get measures for the build time, please? I'd like to try a few things.

I believe it is mach try fuzzy --no-artifact --query "build-win64-plain/opt | build-win64-asan/opt | build-win64/opt | build-win64/debug | build-win64-noopt/debug"

Andrew Overholt [:overholt]

Comment 21

•

7 years ago

Did you get a chance to try to reproduce, Benjamin?

Flags: needinfo?(bbouvier)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 22

•

7 years ago

No, I haven't yet. I think the first step would be to do a bisection to confirm which change in Gecko introduced this regression, then in Cranelift to see if there's an obvious commit that raised compilation time (it might a death-by-a-thousand-commits scenario too).

Flags: needinfo?(bbouvier)

Andrew Overholt [:overholt]

Comment 23

•

7 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #22)

No, I haven't yet. I think the first step would be to do a bisection to confirm which change in Gecko introduced this regression, then in Cranelift to see if there's an obvious commit that raised compilation time (it might a death-by-a-thousand-commits scenario too).

Ionuț, is this something you can do?

Flags: needinfo?(igoldan)

Andrew Overholt [:overholt]

Comment 24

•

7 years ago

(I'm going to mark this fix-optional for 66 since a build time regression doesn't affect users)

status-firefox66: affected → fix-optional

Ionuț Goldan [:igoldan]

Reporter

Comment 25

•

7 years ago

(In reply to Andrew Overholt [:overholt] from comment #23)

(In reply to Benjamin Bouvier [:bbouvier] from comment #22)

No, I haven't yet. I think the first step would be to do a bisection to confirm which change in Gecko introduced this regression, then in Cranelift to see if there's an obvious commit that raised compilation time (it might a death-by-a-thousand-commits scenario too).

Ionuț, is this something you can do?

I don't think bisecting on Cranelift is something I could work on.
As for looking over Gecko changes: I need more details, to understand whether I could help here.

Flags: needinfo?(igoldan)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Keywords: stalled

Lars T Hansen [:lth]

Updated

•

6 years ago

Priority: -- → P3

Benjamin Bouvier [:bbouvier] (inactive)

Comment 26

•

6 years ago

(Note this should now be addressed by the landing of bug 1555894, if this was Cranelift that caused the long build times.)

Flags: needinfo?(sunfish) → needinfo?(igoldan)

Alexandru Ionescu (needinfo me) [:alexandrui]

Updated

•

6 years ago

status-firefox65: disabled → affected

status-firefox66: fix-optional → affected

status-firefox69: --- → affected

status-firefox70: --- → affected

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox65: affected → disabled

status-firefox66: affected → wontfix

status-firefox69: affected → wontfix

status-firefox70: affected → wontfix

status-firefox71: --- → fix-optional

status-firefox-esr68: --- → wontfix

Ionuț Goldan [:igoldan]

Reporter

Comment 27

•

6 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #26)

(Note this should now be addressed by the landing of bug 1555894, if this was Cranelift that caused the long build times.)

Indeed, all regressions here seem fixed now.

Flags: needinfo?(igoldan)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Status: RESOLVED → VERIFIED

BugBot [:suhaib / :marco/ :calixte]

Comment 28

•

6 years ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox71: fix-optional → ---

Target Milestone: --- → mozilla70