Closed Bug 1250797 Opened 8 years ago Closed 8 years ago

Investigate PGO build times with various configurations

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(3 files)

One of my pieces of homework from the build system meet-up happening right now is to investigate PGO build times with various configurations. Things to test:

* Performance on my desktop (i7-6700K) vs automation
* Visual Studio 2013 vs 2015
* Impact of /PogoSafeMode (bug 1249923)

ted: what was that command line flag related to number of threads you mentioned earlier today?
Flags: needinfo?(ted)
First observation and it is a big one: my home desktop machine (i7-6700K - one of the fastest consumer grade CPUs you can currently buy) running Visual Studio 2015 is significantly faster than the fastest PGO build I found in automation.

http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32-pgo/1456025532/mozilla-central-win32-pgo-bm72-build1-build16.txt.gz is the fastest 32-bit PGO build job I found in terms of wall time. Relevant times from that log:

19:42:01 ( 00:00) - invoke `mach build`
19:44:58 ( 02:57) - configure/config.status finish
20:13:43 ( 31:42) - libxul PGO instrumentation begins
20:58:45 ( 76:44) - libxul PGO codegen finishes
...                 2nd compilation pass

On my machine (I haven't got it past the first PGO build pass yet):

00:00 - invoke `mach build`
01:39 - configure/config.status finish
15:48 - libxul PGO instrumentation begins
28:27 - libxul PGO codegen finishes

My machine is ~2.7x faster than automation for roughly the same work. Although, it was only ~2x faster before PGO profiling started. I am running VS2015 Update 1 (vs VS2013 in automation). PGO profiling clearly saturates only 1 physical CPU core.

My machine is 4.0GHz/core. I'm not sure what the machine in automation is running. But Xeons that typically run in data centers (like AWS) are typically not clocked that high. Typically hosting providers go for dual, quad, hex, octa, etc processor configurations and clock speed really drops off as you add more cores - see https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors). Only the new Skylake-DT processors can reach 4.0 GHz and would be cycle-for-cycle comparable to my i7-6700K. I have no clue if Amazon runs those yet or not. I wouldn't bet money on it.
Shows a log from the first pass of a PGO build on my i7-6700K.
I should also mention I did a poor job of sanitizing my machine before measuring. I was actively browsing, etc during the build. I also noticed that Windows Defender was sucking a bunch of CPU during the build. How much, I'm not entirely sure. I wouldn't be surprised if my local build could be 5-10% faster if my computer were otherwise idle and I disabled Windows Defender.
Also, it would be good if you could compare VS2013, since we think VS2015 might be faster in general, so you'd be comparing apples to apples.
Here's a build log of a full PGO 32-bit build from my i7-6700K using VS2013 Update 4.

This is roughly an apples to apples comparison with automation. However, it doesn't do symbol generation, packaging, installer generation, etc, so wall to wall times are not comparable. You can however take the subset of times (the PGO build itself) as apples to apples.

00:00 - `mach build`
01:44 - configure/config.status complete
17:55 - libxul PGO instrumentation linking begins
40:59 - libxul PGO instrumentation ends
43:32 - pgo-profile-run make target begins
46:57 - pgo-profile-run finishes
47:00 - 2nd build pass begins
49:39 - libxul PGO update pass begins
87:25 - libxul PGO update pass finishes
88:31 - build finishes

I need to go back to the automation log to collect times after the libxul PGO instrumentation run. But my machine finished the initial libxul PGO pass at 40:59 versus 76:44 in automation, so it is still significantly faster. VS2015 was 28:27 vs 40:59 on the same machine!!
System resource utilization during 32-bit PGO build on VS2013.

Drop this file into objdir/.mozbuild/build_resources.json and load it with `mach resource-usage`. Or look at the raw data.

Sadly, the section capture from the first pass is lost because we're overwriting e.g. "export" and "compile" with last write wins from the 2nd pass. We should fix that. And we should teach it to explicitly capture PGO phases.
catlee, arr: you may find the preliminary figures in this bug interesting.
bug 1250971 covers adding cgthreads:N to the linker commandline.
Bruce Dawson (from Google) linked me to the Microsoft bug he filed which was fixed in 2015 and is the result of most of the speedup there:
https://connect.microsoft.com/VisualStudio/feedback/details/1064219/ltcg-linking-of-chromes-pdf-dll-spends-60-of-time-in-c2-dll-ssrfree
Bug 1249923 comment 9 contains measurements of the impact of /PogoSafeMode. The MSDN docs say it should make things slower. Yet it actually appeared to speed up PGO linking by a handful of minutes on my machine. Weird.
Using the patch in bug 1251313 and the workaround for the startup crash in bug 1244006, I was able to perform a full PGO build on VS 2015 Update 1!

AFAICT, 2015 is full of wins with no obvious downsides so far!

Using /CGTHREADS:8 (best of 2 runs, which were very similar):

                 VS 2013 /PogoSafeMode   VS 2015     VS 2015 /PogoSafeMode
libxul 1 start         15:36              13:45             13:44
libxul 1 finish        39:16 (23:40)      26:17 (12:32)     26:06 (12:22)
profileserver start    41:40              28:34             28:26
profileserver finish   45:19              31:48             31:53
libxul 2 start         48:01              33:11             33:15
libxul 2 finish        86:22 (38:21)      46:56 (13:45)     47:16 (14:01)
complete               87:24              47:54             48:13

VS 2013 vs 2015 with /CGTHREADS:8 /PogoSafeMode (which will soon be our default configuration):

                        2013    2015   delta
libxul 1 start         15:36   13:44   -1:58
libxul 1 finish        39:16   26:06  -13:08
profileserver start    41:40   28:26  -13:14
profileserver finish   45:19   31:53  -13:26
libxul 2 start         48:01   33:15  -14:46
libxul 2 finish        86:22   47:16  -39:06
complete               87:24   48:13  -39:11

                 2013   2015   delta  percent
libxul pass 1:  23:40  12:22  -11:18    52.3%
libxul pass 2:  38:21  14:01  -24:20    36.5%

Compilation is faster. PGO instrumentation is faster. PGO update is faster. Everything is faster!

What's really going to blow your mind is that my machine is able to perform a PGO build on VS 2015 faster than automation is able to perform an incremental non-PGO build on VS 2013. (But my machine is a 4.0 GHz Skylake and you aren't going to find that on a Xeon chip in a cloud services provider because they don't make Xeons that fast.)

Given the importance of PGO build turnaround time (it is the slowest build and prevents tests from running sooner and therefore slows down chemspills and mozilla-central integration), unless VS 2015 Talos results show a regression, I think upgrading our automation to VS 2015 should be a high priority.
Blocks: vs2015
(In reply to Gregory Szorc [:gps] from comment #12)

> Given the importance of PGO build turnaround time (it is the slowest build
> and prevents tests from running sooner and therefore slows down chemspills
> and mozilla-central integration), unless VS 2015 Talos results show a
> regression, I think upgrading our automation to VS 2015 should be a high
> priority.

Assuming all's well, does that mean we could just turn off opt builds in favor of pgo ones? That would be a major boon in terms of getting accurate and non-redundant talos results.
(In reply to William Lachance (:wlach) from comment #13)
> Assuming all's well, does that mean we could just turn off opt builds in
> favor of pgo ones? That would be a major boon in terms of getting accurate
> and non-redundant talos results.

In the ideal world, we'd likely do PGO builds for all builds (or at least builds that touch binaries). I'm not sure if we'd turn off non-PGO builds. I could see that going both ways.
(In reply to Gregory Szorc [:gps] from comment #14)
> (In reply to William Lachance (:wlach) from comment #13)
> > Assuming all's well, does that mean we could just turn off opt builds in
> > favor of pgo ones? That would be a major boon in terms of getting accurate
> > and non-redundant talos results.
> 
> In the ideal world, we'd likely do PGO builds for all builds (or at least
> builds that touch binaries). I'm not sure if we'd turn off non-PGO builds. I
> could see that going both ways.

Cool, as long as we have a pgo build for every revision with a binary change, that would be enough to make Talos pgo-only (at least on Windows).
Non-PGO builds on 2013 and 2015 appear to take roughly the same amount of wall time (20:50).

On preliminary inspections it looks like there might be a lull towards the end of the build before linking kicks in. The lull on 2015 /might/ be longer than 2013. I'll have to dig into the logs in more detail and/or add more detail in the logs.

So one less regression to worry about with switching to 2015.
So, good news is VS 2015 Update 2 CTP appears to be faster than VS 2015 Update 1: the first PGO pass for libxul is 7:21 compared to 12:22!

The bad news is we have another startup crash in the following from libstagefright:

void VectorImpl::_do_splat(void* dest, const void* item, size_t num) const {
    do_splat(dest, item, num);
}

The Visual Studio debugger is telling me that the crash is in pgort140.dll, which isn't very comforting. I can't even get symbols for the 4 frames after xul.dll.

I'm not the C++ debugging guru. Perhaps someone could install VS 2015 Update 2 CTP and help file bugs...
Depends on: 1253057
Good news: VS 2015 Update 2 RC came out today and it is able to perform a PGO build without crashing (unlike the CTP).

The bad news is the PGO pass speedups from the CTP are gone: Update 2 RC appears to build within statistical variance of Update 1. I can only hope that the speedups from CTP will show up again in Update 3. I guess time will tell.
I was able to get VS2015u1 PGO running in automation last night. 3 each builds of 32-bit and 64-bit completed. The fastest 32-bit build (http://archive.mozilla.org/pub/firefox/try-builds/gszorc@mozilla.com-95a24d38f004f2baa1146a4dd626ecabcf48a3c0/try-win32/try-win32-bm78-try1-build3976.txt.gz) *job* took 135 minutes. Here are the relevant times since build start from the compilation:

libxul 1 start   27:34
libxul 1 end     49:53 (22:19)
libxul 1 start   65:15
libxul 2 end    108:34 (43:19)

Let's compare that to the fastest PGO job I could find on central (http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32-pgo/1457440245/mozilla-central-win32-pgo-bm73-build1-build1.txt.gz):

libxul 1 start   32:24
libxul 1 end     80:00 (47:36)
libxul 2 start  103:26
libxul 2 end    172:38 (69:12)

Pass 1 was >2x faster. Pass 2 was <2x faster. (I had a 3x speedup on pass 2 on my home machine - weird.)

To summarize the PGO build times through libxul 2nd pass:

                  VS2013        VS2015      | VS2015 (i7-6700K @ home)
libxul 1 start     32:24     27:34 (- 4:50) |        13:44 (-13:50)
libxul 1 end       80:00     49:53 (-30:07) |        26:06 (-23:47)
libxul 2 start    103:26     65:15 (-38:11) |        33:15 (-32:00)
libxul 2 end      172:38    108:34 (-64:04) |        48:13 (-60:21)

Important takeaways:

* VS2015u1 is >1hr faster than VS2013 doing PGO builds in automation!
* VS2015u1 on my i7-6700K is >1hr faster than VS2015u1 running in automation
I'm not going to post the numbers, but it looks like 64-bit builds are at least 1 hr faster in automation. Possibly on the order of 90-100 minutes faster!
I'm going to say this bug has served its purpose and there is no major work to do here except finish up the transition to the VS 2015 in automation and see what the actual numbers tell us.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: