stylo: 3.67 - 51.66% build times / installer size (windows2012-32, windows2012-64, windows8-64) regression from enabling stylo (push 06ec956f753346236bf89f2bead2066d22778fc2, Wed Jul 5 2017)

RESOLVED WONTFIX

Status

()

defect
P3
normal
RESOLVED WONTFIX
2 years ago
2 years ago

People

(Reporter: igoldan, Unassigned)

Tracking

(Depends on 1 bug, {regression})

Trunk
All
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

We have detected a build metrics regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=06ec956f753346236bf89f2bead2066d22778fc2

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 52%  build times summary windows8-64 opt buildbot-c3.4xlarge     2,374.82 -> 3,601.68
 31%  build times summary windows2012-32 pgo taskcluster-c4.4xlarge4,401.73 -> 5,748.64
 26%  build times summary windows2012-64 pgo taskcluster-c4.4xlarge4,733.85 -> 5,969.88
  4%  installer size summary windows2012-32 pgo                   51,341,706.33 -> 53,415,840.75
  4%  installer size summary windows2012-64 pgo                   55,395,777.00 -> 57,429,896.75


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=7709

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics
Component: Untriaged → CSS Parsing and Computation
OS: Unspecified → Windows
Product: Firefox → Core
Hardware: Unspecified → All
Version: unspecified → Trunk
:froydnj Please look over these Windows build issues and provide feedback, by estimating a fix time for them.
Flags: needinfo?(nfroyd)
Summary: 3.67 - 51.66% build times / installer size (windows2012-32, windows2012-64, windows8-64) regression on push 06ec956f753346236bf89f2bead2066d22778fc2 (Wed Jul 5 2017) → stylo: 3.67 - 51.66% build times / installer size (windows2012-32, windows2012-64, windows8-64) regression from enabling stylo (push 06ec956f753346236bf89f2bead2066d22778fc2, Wed Jul 5 2017)
Bobby, do you know why building Stylo would cause a 25-50% build time regression on Windows? Is this a problem with our rust build configuration on Windows?
Flags: needinfo?(bobbyholley)
Probably Xidorn also may have insight on this. From memory, I've heard linking the rust libraries takes forever on Windows... But not sure whether we can do much about it.
Flags: needinfo?(xidorn+moz)
The logs of 2012-64 pgo can be found:
* before stylo: https://public-artifacts.taskcluster.net/I7NEoh9BRiKx0-y0weWQcg/0/public/logs/live_backing.log
* after stylo: https://public-artifacts.taskcluster.net/WH7FQLAgQumWWDmklEAuMA/0/public/logs/live_backing.log

I took a brief comparison on the first build of the pgo builds. It seems that the configuration time and link time are roughly the same in the two log (1min for configuration, 21min for linking).

The issue for the post-stylo build is that, the cargo part takes way too long.

In the pre-stylo log, cargo task takes ~8.5min, in which it takes ~2min to build gkrust crate. The whole build time is ~16min.

However, in the post-stylo log, cargo task takes ~36min, in which it takes ~8min to build gkrust crate. The whole build time extends to ~38.5min, and the cargo task is the task which blocks the final linking.

Bad news is that we start cargo task early enough (~2.5min after build start), so the problem here is just that it takes too long for Rust code to build.

I suspect it is mainly because we spawn too many processes to build. IIUC, cargo isn't aware of the overall computation resource allocation, and it just spawns processes as if it is the only user of CPU, while mach considers cargo task to be a single process. If that's true, we basically spend too much time running almost twice as many processes as we actually want during the early stage of build, and when cargo enters its final, not-quite-parallel stage, we have finished building everything else, and thus the utilization is significantly lowered.
What I saw before was that linking the build script of style crate takes ~3min. This isn't something observed in this log.

From that log, building the build script takes ~0.5min, and running it takes ~0.3min, and after that, it takes 10min to build style crate before we get to geckoservo crate, and geckoservo crate itself takes another 3min to build.

So the Rust code being too slow to build is probably the root cause.
Flags: needinfo?(xidorn+moz)
See Also: → 1378791
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #4)
> I suspect it is mainly because we spawn too many processes to build. IIUC,
> cargo isn't aware of the overall computation resource allocation, and it
> just spawns processes as if it is the only user of CPU, while mach considers
> cargo task to be a single process. If that's true, we basically spend too
> much time running almost twice as many processes as we actually want during
> the early stage of build, and when cargo enters its final,
> not-quite-parallel stage, we have finished building everything else, and
> thus the utilization is significantly lowered.

Xidorn, can we limit the number of processes cargo uses on Windows or teach mach that cargo spawns multiple processes? I know Windows processes are much more heavyweight than Linux processes, but why isn't this a problem on Linux?
Flags: needinfo?(bobbyholley) → needinfo?(xidorn+moz)
(In reply to Chris Peterson [:cpeterson] from comment #6)
> Xidorn, can we limit the number of processes cargo uses on Windows or teach
> mach that cargo spawns multiple processes? I know Windows processes are much
> more heavyweight than Linux processes, but why isn't this a problem on Linux?

That is bug 1367940.
Flags: needinfo?(xidorn+moz)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #1)
> :froydnj Please look over these Windows build issues and provide feedback,
> by estimating a fix time for them.

The installer size changes are expected, even if they may be a little large; we're simply shipping more code here, no way around that.

I don't have a firm fix time for the build time regressions.  We'll investigate making some Rust changes in bug 1367940 and maybe some other places, but there's no quick fix.

I'd be curious if sccache helps at all here--presumably these first couple Windows compiles are having to recompile all new code, while later runs would be able to at least hit in the cache.  The graphs look like the build time is pretty consistent, so maybe it's just general overhead...
Flags: needinfo?(nfroyd)
Depends on: 1367940
(In reply to Chris Peterson [:cpeterson] from comment #6)
> Xidorn, can we limit the number of processes cargo uses on Windows or teach
> mach that cargo spawns multiple processes? I know Windows processes are much
> more heavyweight than Linux processes, but why isn't this a problem on Linux?

We should not limit the number of processes cargo uses. Instead, we should limit everything else. Cargo is the slowest part, so limiting it would simply make things worse.

I have no idea why this wasn't a problem for Linux. Maybe Linux's scheduling fits this workload pattern better, or the linker is much faster on Linux so building gkrust doesn't take that much time. It's also possible that sccache doesn't take effect for Rust on Windows somehow.
Also, this sccache regression was noticed. Please consider a fix for it too.

== Change summary for alert #7718 (as of July 06 2017 10:34 UTC) ==

Regressions:

 65%  sccache requests_not_cacheable summary windowsxp opt buildbot-c3.4xlarge     20.00 -> 33.00

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7718
See Also: → 1353472
Depends on: 1380321
Depends on: 1380327
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #10)
> Also, this sccache regression was noticed. Please consider a fix for it too.
>  65%  sccache requests_not_cacheable summary windowsxp opt
> buildbot-c3.4xlarge     20.00 -> 33.00

This is expected, there are a number of rust compilations that are not cacheable by sccache.
Priority: -- → P3
given comment 11, I don't think there is specific work to do in this bug.  Please reopen if you feel there is work here.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
Depends on: 1386371
You need to log in before you can comment on or make changes to this bug.