1542746 - Switch clang-based PGO builds from front-end based instrumentation to IR-level instrumentation.

Reporter

Description

•

6 years ago

Clang/LLVM supports two flavours of instrumentation-based PGO:

Front-end based instrumentation (-fprofile-instr-generate) and
IR-level instrumentation (-fprofile-generate).

The Rust compiler will only support the latter (at least for the foreseeable future) and the two flavours cannot be mixed within the same binary. So in order to enable PGO for both Rust and C code at the same time, C code needs to be switched to using -fprofile-generate.

Doing so on Linux seems to work fine, as indicated by these try-builds:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9767d518556df6dfe911353042e02eeeca8c10aa (Clang 7)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=01d2b39ac644580e5e7c04343d4367ac0dc06721 (Clang 8, including PGO for Rust)

On Windows this might or might not be blocked on problems with instrumentation and exception handling (see https://bugzilla.mozilla.org/show_bug.cgi?id=1437452#c41)

Nathan Froyd [:froydnj]

Assignee

Updated

•

6 years ago

Assignee: nobody → nfroyd

Nathan Froyd [:froydnj]

Assignee

Comment 1

•

6 years ago

The initial performance results are not encouraging:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=df56e4acc60147032e2738907324f088bcb9ae1a&newProject=try&newRevision=e8b717f52094c622d68ffa949b2c1210f515f169

Basically everything loses on windows (except a massive 15% improvement in raptor-webaudio), and there is the occasional improvement on linux.

I think what's happening is the the MOZ_PROFILE_ORDER_FILE stuff really helped startup times by grouping a bunch of code, and we lost that ability with the IR-level instrumentation. I don't think the two kinds of instrumentation are usable together. What we may have to do is what dmajor suggested doing when the MOZ_PROFILE_ORDER_FILE stuff first went into the tree: check in static versions of the ordering files to use with the linker, even when we're using IR-level instrumentation. That lets us get the fine-grained profiling information, and enable the linker to appropriately reorder things to improve startup time. We can refresh the order files periodically to account for new code being added to the tree.

With any luck, we might see small improvements on Windows with that change.

(Away)

Comment 2

•

6 years ago

Out of curiosity, what makes order files incompatible with IR-level PGO?

Nathan Froyd [:froydnj]

Assignee

Comment 3

•

6 years ago

(In reply to David Major [:dmajor] (low availability) from comment #2)

Out of curiosity, what makes order files incompatible with IR-level PGO?

I should have been more precise here: it's not that order files are incompatible with IR-level PGO, it's that the generation of order files requires frontend-level instrumentation, and we cannot use both frontend level instrumentation and IR level instrumentation simultaneously, AFAIK.

We can still use order files in the linking step when we use IR-level PGO, we just have to find some other way of generating them, and checking in static files seems like the easiest way forward there.

(Away)

Comment 4

•

6 years ago

Gotcha, thanks for the clarification.

I suppose in the worst case we could do separate build+training phases just to generate order files, but for that amount of trouble the gain would really have to be worth it! We'd have to weigh that against the trouble of updating the checked-in files (I'm guessing we could do that once at the end of each nightly cycle, to get something reasonably recent onto the departing beta train).

Nathan Froyd [:froydnj]

Assignee

Comment 5

•

6 years ago

(In reply to David Major [:dmajor] (low availability) from comment #4)

I suppose in the worst case we could do separate build+training phases just to generate order files, but for that amount of trouble the gain would really have to be worth it! We'd have to weigh that against the trouble of updating the checked-in files (I'm guessing we could do that once at the end of each nightly cycle, to get something reasonably recent onto the departing beta train).

The upside of that approach would be that with the multi-stage PGO work coming to Windows, you can imagine a taskgraph that looks like:

frontend-level build ----> ordering run ---+
                                           |
                                           |
                                           +---> profile-use build
                                           |
                                           |
IR-level build ----------> training run ---+

so that, in theory, getting accurate ordering information wouldn't really add latency to the build process.

I don't have a good sense for how much more accurate ordering information really matters. My suspicion is that the first 100k functions or whatever we measure don't tend to change all that much over time...

(Away)

Comment 6

•

6 years ago

I don't have a good sense for how much more accurate ordering information
really matters. My suspicion is that the first 100k functions or whatever
we measure don't tend to change all that much over time...

The identity of those functions (as we think of them) probably doesn't change much, but I wouldn't be surprised to see their formal spelling change due to MFBT refactorings, etc.

Nathan Froyd [:froydnj]

Assignee

Comment 7

•

6 years ago

New results, this time with an order file applied:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=df56e4acc60147032e2738907324f088bcb9ae1a&newProject=try&newRevision=efa45c5bbf841a6212504a349177a3c7f1ded1df

Different tests regress, and a few tests regress slightly less, but overall the same as before.

Looking at the log, I see that we are passing in the correct order file, but the linker complains about...a lot of missing symbols. Which is peculiar, because I know that I grabbed the order files from the same base m-c revision as the try push compared above.

It doesn't look like I got the win32 and win64 order files mixed up (e.g. the win32 one has mangled names with __fastcall annotations that correspond to NS_FASTCALL methods and the win64 one does not). So I'm not sure what's going wrong there. Unless the IR-level optimizations indicate that inlining many more functions would be profitable, or something? The linker warns for ~8000 or so functions, and there's ~27000 functions in the order file, so maybe that's about right?

(Away)

Comment 8

•

6 years ago

I see >7000 missing symbol warnings on existing m-c builds, so I wouldn't worry about it.

Nathan Froyd [:froydnj]

Assignee

Comment 9

•

6 years ago

sccache isn't getting in the way, because we don't use sccache for shippable builds. Boo.

Stats on the shippable build says the binary is 8MB bigger (~8% regression), so maybe that's partly responsible for things like e.g. main_startup_fileio regressions. If we are getting hit hard on startup, maybe that's also responsible for some of the other regressions as well?

Michael Woerister

Reporter

Comment 10

•

6 years ago

Something that might be interesting here: the -mllvm -disable-preinline flag that we talked about -- he same effect can be reached when optimizing for size:
https://github.com/llvm/llvm-project/blob/llvmorg-8.0.0/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp#L278-L280

Nathan Froyd [:froydnj]

Assignee

Comment 11

•

6 years ago

Tests confirm that a dynamic vs. static order file doesn't make that much difference (with the current frontend-based instrumentation):

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9c0e0308cf53ef82d29c0789fe4be6fb34faf62e&newProject=try&newRevision=ddd99afc517e7738222e1b9a6fa404657910e4be

I am starting to wonder if the IR-based implementation is even a) saving the data in the correct place and b) using the correct data. I suppose it must be using the data--presumably we wouldn't have seen -Wbackend-plugin errors--but maybe we're not correctly aggregating all the data we collected somehow? That isn't a great theory--I'd expect things to work basically the same between our clang-cl builds and our clang (PGO) builds, but maybe there's something related to Windows paths or similar.

Nathan Froyd [:froydnj]

Assignee

Comment 12

•

6 years ago

Ah, so, the profiling data that we're getting out of the IR-level instrumentation is ~600k in size, vs. 200MB for the front-end based instrumentation. And that's because we're dying with some sort of access violation when we're doing the first run to set up our profile. So what we're actually doing is trying to do profile-guided optimization without any sort of useful profiling data, which is bad.

Not sure yet where we're dying; going to try and look at that today.

Michael Shal [:mshal]

Comment 13

•

6 years ago

(In reply to Nathan Froyd [:froydnj] from comment #12)

Ah, so, the profiling data that we're getting out of the IR-level instrumentation is ~600k in size, vs. 200MB for the front-end based instrumentation. And that's because we're dying with some sort of access violation when we're doing the first run to set up our profile. So what we're actually doing is trying to do profile-guided optimization without any sort of useful profiling data, which is bad.

Not sure yet where we're dying; going to try and look at that today.

Sounds like bug 1548515, or is that something different?

Nathan Froyd [:froydnj]

Assignee

Comment 14

•

6 years ago

(In reply to Michael Shal [:mshal] from comment #13)

(In reply to Nathan Froyd [:froydnj] from comment #12)

Ah, so, the profiling data that we're getting out of the IR-level instrumentation is ~600k in size, vs. 200MB for the front-end based instrumentation. And that's because we're dying with some sort of access violation when we're doing the first run to set up our profile. So what we're actually doing is trying to do profile-guided optimization without any sort of useful profiling data, which is bad.

Not sure yet where we're dying; going to try and look at that today.

Sounds like bug 1548515, or is that something different?

dmajor agreed to look into this problem, and investigation seems to be pointing at a situation similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1514965#c18 . We probably have to turn off pgo instrumentation for more files at the very minimum.

Mike Hommey [:glandium]

Comment 15

•

6 years ago

So... the profile-run tier doesn't fail when Firefox crashes? Didn't we make it fail when it was all in one job? (or was that still pending because we had too much failure?)

Michael Shal [:mshal]

Comment 16

•

6 years ago

(In reply to Mike Hommey [:glandium] from comment #15)

So... the profile-run tier doesn't fail when Firefox crashes? Didn't we make it fail when it was all in one job? (or was that still pending because we had too much failure?)

Yeah, bug 1252556 made the task fail if Firefox fails during PGO. That should still apply for the profile-run tier in 3-tier PGO - is there a treeherder link I can look at where that's not the case?

(Away)

Comment 17

•

6 years ago

(In reply to Michael Shal [:mshal] from comment #16)

Yeah, bug 1252556 made the task fail if Firefox fails during PGO. That
should still apply for the profile-run tier in 3-tier PGO - is there a
treeherder link I can look at where that's not the case?

Here's one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=98bea3ecf05e88baad3abaf85c5bf96c69c5606e&selectedJob=245462576

Nathan Froyd [:froydnj]

Assignee

Updated

•

6 years ago

Depends on: 1550868

Michael Shal [:mshal]

Comment 18

•

6 years ago

(In reply to David Major [:dmajor] from comment #17)

(In reply to Michael Shal [:mshal] from comment #16)

Yeah, bug 1252556 made the task fail if Firefox fails during PGO. That
should still apply for the profile-run tier in 3-tier PGO - is there a
treeherder link I can look at where that's not the case?

Here's one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=98bea3ecf05e88baad3abaf85c5bf96c69c5606e&selectedJob=245462576

So, this is interesting - apparently the fact that we call profileserver.py with an environment variable (setting JARLOG_FILE) causes the return code checking to fail in our version of make on Windows.

STR:

$ cat Makefile
profiledbuild:
@echo Hi
JARLOG_FILE=foo python fail.py
@echo SHOULD NOT SEE THIS

$ cat fail.py
import sys
rc = 3221225477
print("Returning rc=%i" % rc)
sys.exit(rc)

$ mozmake
Hi
JARLOG_FILE=foo python fail.py
Returning rc=3221225477
SHOULD NOT SEE THIS

If instead you remove the JARLOG_FILE=foo from the Makefile, so that it just calls 'python fail.py' as the second line, then you get the expected behavior:

$ mozmake
Hi
python fail.py
Returning rc=3221225477
mozmake: *** [Makefile:3: profiledbuild] Error -1

On my Linux machine with make-4.1, both cases get the expected behavior.

Maybe we should just change it to sys.exit(1) instead of sys.exit(ret)? Using 'rc = 1' in fail.py produces the expected behavior both with an without the environment variable, so it seems the specific value of the return code matters.

glandium, any insights here?

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 19

•

6 years ago

This smells like a bug in make on Windows. I was able to reproduce, and to find a workaround: use env. as in env JARLOG_FILE=foo python ... instead of JARLOG_FILE=foo python ....

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 20

•

6 years ago

FWIW, it seems the bug is fixed in GNU make git master.

Nathan Froyd [:froydnj]

Assignee

Updated

•

6 years ago

Depends on: 1551698

Nathan Froyd [:froydnj]

Assignee

Comment 21

•

6 years ago

Attached file Bug 1542746 - use a static MOZ_PROFILE_ORDER_FILE; r=#build — Details

We're planning on switching to IR-based profiling, so we can't use the
frontend-based instrumentation to collect the order in which functions
are executed...at least not during the build itself. Performance tests
indicate that not having the order information decreases performance
significantly. So we're going to check in static files for Win32 and
Win64 and use those to perform the ordering. It's OK if these files are
slightly out of date; as of this writing, builds that generate and then
use these files complain that ~1/3 of the functions can't be found (!).
We're just trying to do something slightly smarter than whatever the
linker default is.

Nathan Froyd [:froydnj]

Assignee

Comment 22

•

6 years ago

Attached file Bug 1542746 - switch clang{,-cl} to use IR-based instrumentation for PGO; r=#build — Details

This form of instrumentation is more like our other platforms, and also
opens the possibility of interacting properly with Rust IR-level PGO.

Depends on D31132

Nathan Froyd [:froydnj]

Assignee

Comment 23

•

6 years ago

Attached file Bug 1542746 - remove code for frontend-based PGO instrumentation; r=#build — Details

We're moving to IR-level PGO instrumentation for clang-cl. We've also
moved to using static linker ordering files, which was the primary
application of the previous style of PGO instrumentation. We therefore
we no longer need this code.

Depends on D31133

Michael Shal [:mshal]

Comment 24

•

6 years ago

(In reply to Mike Hommey [:glandium] from comment #20)

FWIW, it seems the bug is fixed in GNU make git master.

Is it worth backporting the fix into mozmake? It seems like the kind of thing that could bite us elsewhere if we just use the 'env' fix for PGO. (During a quick search, I see another instance of setting an environment variable this way in the packager: https://dxr.mozilla.org/mozilla-central/rev/5f95b3f2ea44723ba6a8c41a4b27c88032df709f/toolkit/mozapps/installer/packager.mk#23, but maybe that isn't able to fail with a bad error code)

:Gijs (he/him)

Comment 25

•

6 years ago

Note: I pushed to try with this with some other stuff, and it looks like taskcluster/scripts/misc/run-profileserver.sh needs updating - it assumes we have a single default.profraw file at https://searchfox.org/mozilla-central/source/taskcluster/scripts/misc/run-profileserver.sh#36 .

:Gijs (he/him)

Updated

•

6 years ago

Blocks: 1196094

Nathan Froyd [:froydnj]

Assignee

Updated

•

6 years ago

Depends on: 1553823

Nathan Froyd [:froydnj]

Assignee

Updated

•

6 years ago

Depends on: 1553972

Nathan Froyd [:froydnj]

Assignee

Updated

•

6 years ago

Depends on: 1553973

Pulsebot

Comment 26

•

6 years ago

Pushed by nfroyd@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/13461d98a55f use a static MOZ_PROFILE_ORDER_FILE; r=nalexander,chmanchester https://hg.mozilla.org/integration/autoland/rev/bd7bac95cd92 switch clang{,-cl} to use IR-based instrumentation for PGO; r=firefox-build-system-reviewers,chmanchester https://hg.mozilla.org/integration/autoland/rev/085342ba416f remove code for frontend-based PGO instrumentation; r=dmajor

Andrei Ciure[:aciure]

Comment 27

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/13461d98a55f
https://hg.mozilla.org/mozilla-central/rev/bd7bac95cd92
https://hg.mozilla.org/mozilla-central/rev/085342ba416f

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox69: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla69

Ionuț Goldan [:igoldan]

Comment 28

•

6 years ago

Some windows build times have improved:

== Change summary for alert #21085 (as of Sat, 25 May 2019 00:48:46 GMT) ==

Improvements:

5% build times windows2012-32-shippable opt nightly taskcluster-c4.4xlarge 5,956.90 -> 5,648.71
4% build times windows2012-64-shippable opt nightly taskcluster-c4.4xlarge 6,398.66 -> 6,114.51

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21085

Florin Strugariu [:Bebe]

Comment 29

•

6 years ago

== Change summary for alert #21091 (as of Sat, 25 May 2019 16:44:23 GMT) ==

Improvements:

5% raptor-webaudio-firefox linux64-shippable-qr opt 159.50 -> 150.92
4% raptor-webaudio-firefox linux64-shippable opt 152.67 -> 145.83
3% raptor-tp6-instagram-firefox loadtime linux64-shippable opt 333.67 -> 322.29
3% raptor-speedometer-firefox linux64-shippable-qr opt 88.85 -> 91.26
2% raptor-tp6-sheets-firefox loadtime linux64-shippable-qr opt 709.96 -> 695.38

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21091

Ionuț Goldan [:igoldan]

Updated

•

6 years ago

Regressions: 1555298

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

Regressions: 1562612

:Gijs (he/him)

Updated

•

6 years ago

Regressions: 1590427

Masatoshi Kimura [:emk]

Updated

•

6 years ago

Depends on: 1606191

Bug 1542746 - use a static MOZ_PROFILE_ORDER_FILE; r=#build 6 years ago Nathan Froyd [:froydnj] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1542746 - switch clang{,-cl} to use IR-based instrumentation for PGO; r=#build 6 years ago Nathan Froyd [:froydnj] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1542746 - remove code for frontend-based PGO instrumentation; r=#build 6 years ago Nathan Froyd [:froydnj] 47 bytes, text/x-phabricator-request		Details \| Review