Closed Bug 1542746 Opened 9 months ago Closed 8 months ago

Switch clang-based PGO builds from front-end based instrumentation to IR-level instrumentation.

Categories

(Firefox Build System :: General, defect)

defect
Not set

Tracking

(firefox69 fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox69 --- fixed

People

(Reporter: mwoerister, Assigned: froydnj)

References

(Depends on 2 open bugs, Regressed 1 open bug)

Details

Attachments

(3 files)

Clang/LLVM supports two flavours of instrumentation-based PGO:

  • Front-end based instrumentation (-fprofile-instr-generate) and
  • IR-level instrumentation (-fprofile-generate).

The Rust compiler will only support the latter (at least for the foreseeable future) and the two flavours cannot be mixed within the same binary. So in order to enable PGO for both Rust and C code at the same time, C code needs to be switched to using -fprofile-generate.

Doing so on Linux seems to work fine, as indicated by these try-builds:

On Windows this might or might not be blocked on problems with instrumentation and exception handling (see https://bugzilla.mozilla.org/show_bug.cgi?id=1437452#c41)

Assignee: nobody → nfroyd

The initial performance results are not encouraging:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=df56e4acc60147032e2738907324f088bcb9ae1a&newProject=try&newRevision=e8b717f52094c622d68ffa949b2c1210f515f169

Basically everything loses on windows (except a massive 15% improvement in raptor-webaudio), and there is the occasional improvement on linux.

I think what's happening is the the MOZ_PROFILE_ORDER_FILE stuff really helped startup times by grouping a bunch of code, and we lost that ability with the IR-level instrumentation. I don't think the two kinds of instrumentation are usable together. What we may have to do is what dmajor suggested doing when the MOZ_PROFILE_ORDER_FILE stuff first went into the tree: check in static versions of the ordering files to use with the linker, even when we're using IR-level instrumentation. That lets us get the fine-grained profiling information, and enable the linker to appropriately reorder things to improve startup time. We can refresh the order files periodically to account for new code being added to the tree.

With any luck, we might see small improvements on Windows with that change.

Out of curiosity, what makes order files incompatible with IR-level PGO?

(In reply to David Major [:dmajor] (low availability) from comment #2)

Out of curiosity, what makes order files incompatible with IR-level PGO?

I should have been more precise here: it's not that order files are incompatible with IR-level PGO, it's that the generation of order files requires frontend-level instrumentation, and we cannot use both frontend level instrumentation and IR level instrumentation simultaneously, AFAIK.

We can still use order files in the linking step when we use IR-level PGO, we just have to find some other way of generating them, and checking in static files seems like the easiest way forward there.

Gotcha, thanks for the clarification.

I suppose in the worst case we could do separate build+training phases just to generate order files, but for that amount of trouble the gain would really have to be worth it! We'd have to weigh that against the trouble of updating the checked-in files (I'm guessing we could do that once at the end of each nightly cycle, to get something reasonably recent onto the departing beta train).

(In reply to David Major [:dmajor] (low availability) from comment #4)

I suppose in the worst case we could do separate build+training phases just to generate order files, but for that amount of trouble the gain would really have to be worth it! We'd have to weigh that against the trouble of updating the checked-in files (I'm guessing we could do that once at the end of each nightly cycle, to get something reasonably recent onto the departing beta train).

The upside of that approach would be that with the multi-stage PGO work coming to Windows, you can imagine a taskgraph that looks like:

frontend-level build ----> ordering run ---+
                                           |
                                           |
                                           +---> profile-use build
                                           |
                                           |
IR-level build ----------> training run ---+

so that, in theory, getting accurate ordering information wouldn't really add latency to the build process.

I don't have a good sense for how much more accurate ordering information really matters. My suspicion is that the first 100k functions or whatever we measure don't tend to change all that much over time...

I don't have a good sense for how much more accurate ordering information
really matters. My suspicion is that the first 100k functions or whatever
we measure don't tend to change all that much over time...

The identity of those functions (as we think of them) probably doesn't change much, but I wouldn't be surprised to see their formal spelling change due to MFBT refactorings, etc.

New results, this time with an order file applied:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=df56e4acc60147032e2738907324f088bcb9ae1a&newProject=try&newRevision=efa45c5bbf841a6212504a349177a3c7f1ded1df

Different tests regress, and a few tests regress slightly less, but overall the same as before.

Looking at the log, I see that we are passing in the correct order file, but the linker complains about...a lot of missing symbols. Which is peculiar, because I know that I grabbed the order files from the same base m-c revision as the try push compared above.

It doesn't look like I got the win32 and win64 order files mixed up (e.g. the win32 one has mangled names with __fastcall annotations that correspond to NS_FASTCALL methods and the win64 one does not). So I'm not sure what's going wrong there. Unless the IR-level optimizations indicate that inlining many more functions would be profitable, or something? The linker warns for ~8000 or so functions, and there's ~27000 functions in the order file, so maybe that's about right?

I see >7000 missing symbol warnings on existing m-c builds, so I wouldn't worry about it.

sccache isn't getting in the way, because we don't use sccache for shippable builds. Boo.

Stats on the shippable build says the binary is 8MB bigger (~8% regression), so maybe that's partly responsible for things like e.g. main_startup_fileio regressions. If we are getting hit hard on startup, maybe that's also responsible for some of the other regressions as well?

Something that might be interesting here: the -mllvm -disable-preinline flag that we talked about -- he same effect can be reached when optimizing for size:
https://github.com/llvm/llvm-project/blob/llvmorg-8.0.0/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp#L278-L280

Tests confirm that a dynamic vs. static order file doesn't make that much difference (with the current frontend-based instrumentation):

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9c0e0308cf53ef82d29c0789fe4be6fb34faf62e&newProject=try&newRevision=ddd99afc517e7738222e1b9a6fa404657910e4be

I am starting to wonder if the IR-based implementation is even a) saving the data in the correct place and b) using the correct data. I suppose it must be using the data--presumably we wouldn't have seen -Wbackend-plugin errors--but maybe we're not correctly aggregating all the data we collected somehow? That isn't a great theory--I'd expect things to work basically the same between our clang-cl builds and our clang (PGO) builds, but maybe there's something related to Windows paths or similar.

Ah, so, the profiling data that we're getting out of the IR-level instrumentation is ~600k in size, vs. 200MB for the front-end based instrumentation. And that's because we're dying with some sort of access violation when we're doing the first run to set up our profile. So what we're actually doing is trying to do profile-guided optimization without any sort of useful profiling data, which is bad.

Not sure yet where we're dying; going to try and look at that today.

(In reply to Nathan Froyd [:froydnj] from comment #12)

Ah, so, the profiling data that we're getting out of the IR-level instrumentation is ~600k in size, vs. 200MB for the front-end based instrumentation. And that's because we're dying with some sort of access violation when we're doing the first run to set up our profile. So what we're actually doing is trying to do profile-guided optimization without any sort of useful profiling data, which is bad.

Not sure yet where we're dying; going to try and look at that today.

Sounds like bug 1548515, or is that something different?

(In reply to Michael Shal [:mshal] from comment #13)

(In reply to Nathan Froyd [:froydnj] from comment #12)

Ah, so, the profiling data that we're getting out of the IR-level instrumentation is ~600k in size, vs. 200MB for the front-end based instrumentation. And that's because we're dying with some sort of access violation when we're doing the first run to set up our profile. So what we're actually doing is trying to do profile-guided optimization without any sort of useful profiling data, which is bad.

Not sure yet where we're dying; going to try and look at that today.

Sounds like bug 1548515, or is that something different?

dmajor agreed to look into this problem, and investigation seems to be pointing at a situation similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1514965#c18 . We probably have to turn off pgo instrumentation for more files at the very minimum.

So... the profile-run tier doesn't fail when Firefox crashes? Didn't we make it fail when it was all in one job? (or was that still pending because we had too much failure?)

(In reply to Mike Hommey [:glandium] from comment #15)

So... the profile-run tier doesn't fail when Firefox crashes? Didn't we make it fail when it was all in one job? (or was that still pending because we had too much failure?)

Yeah, bug 1252556 made the task fail if Firefox fails during PGO. That should still apply for the profile-run tier in 3-tier PGO - is there a treeherder link I can look at where that's not the case?

(In reply to Michael Shal [:mshal] from comment #16)

Yeah, bug 1252556 made the task fail if Firefox fails during PGO. That
should still apply for the profile-run tier in 3-tier PGO - is there a
treeherder link I can look at where that's not the case?

Here's one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=98bea3ecf05e88baad3abaf85c5bf96c69c5606e&selectedJob=245462576

Depends on: 1550868

(In reply to David Major [:dmajor] from comment #17)

(In reply to Michael Shal [:mshal] from comment #16)

Yeah, bug 1252556 made the task fail if Firefox fails during PGO. That
should still apply for the profile-run tier in 3-tier PGO - is there a
treeherder link I can look at where that's not the case?

Here's one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=98bea3ecf05e88baad3abaf85c5bf96c69c5606e&selectedJob=245462576

So, this is interesting - apparently the fact that we call profileserver.py with an environment variable (setting JARLOG_FILE) causes the return code checking to fail in our version of make on Windows.

STR:

$ cat Makefile
profiledbuild:
@echo Hi
JARLOG_FILE=foo python fail.py
@echo SHOULD NOT SEE THIS

$ cat fail.py
import sys
rc = 3221225477
print("Returning rc=%i" % rc)
sys.exit(rc)

$ mozmake
Hi
JARLOG_FILE=foo python fail.py
Returning rc=3221225477
SHOULD NOT SEE THIS

If instead you remove the JARLOG_FILE=foo from the Makefile, so that it just calls 'python fail.py' as the second line, then you get the expected behavior:

$ mozmake
Hi
python fail.py
Returning rc=3221225477
mozmake: *** [Makefile:3: profiledbuild] Error -1

On my Linux machine with make-4.1, both cases get the expected behavior.

Maybe we should just change it to sys.exit(1) instead of sys.exit(ret)? Using 'rc = 1' in fail.py produces the expected behavior both with an without the environment variable, so it seems the specific value of the return code matters.

glandium, any insights here?

Flags: needinfo?(mh+mozilla)

This smells like a bug in make on Windows. I was able to reproduce, and to find a workaround: use env. as in env JARLOG_FILE=foo python ... instead of JARLOG_FILE=foo python ....

Flags: needinfo?(mh+mozilla)

FWIW, it seems the bug is fixed in GNU make git master.

Depends on: 1551698

We're planning on switching to IR-based profiling, so we can't use the
frontend-based instrumentation to collect the order in which functions
are executed...at least not during the build itself. Performance tests
indicate that not having the order information decreases performance
significantly. So we're going to check in static files for Win32 and
Win64 and use those to perform the ordering. It's OK if these files are
slightly out of date; as of this writing, builds that generate and then
use these files complain that ~1/3 of the functions can't be found (!).
We're just trying to do something slightly smarter than whatever the
linker default is.

This form of instrumentation is more like our other platforms, and also
opens the possibility of interacting properly with Rust IR-level PGO.

Depends on D31132

We're moving to IR-level PGO instrumentation for clang-cl. We've also
moved to using static linker ordering files, which was the primary
application of the previous style of PGO instrumentation. We therefore
we no longer need this code.

Depends on D31133

(In reply to Mike Hommey [:glandium] from comment #20)

FWIW, it seems the bug is fixed in GNU make git master.

Is it worth backporting the fix into mozmake? It seems like the kind of thing that could bite us elsewhere if we just use the 'env' fix for PGO. (During a quick search, I see another instance of setting an environment variable this way in the packager: https://dxr.mozilla.org/mozilla-central/rev/5f95b3f2ea44723ba6a8c41a4b27c88032df709f/toolkit/mozapps/installer/packager.mk#23, but maybe that isn't able to fail with a bad error code)

Note: I pushed to try with this with some other stuff, and it looks like taskcluster/scripts/misc/run-profileserver.sh needs updating - it assumes we have a single default.profraw file at https://searchfox.org/mozilla-central/source/taskcluster/scripts/misc/run-profileserver.sh#36 .

Blocks: 1196094
Depends on: 1553823
Depends on: 1553972
Depends on: 1553973
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/13461d98a55f
use a static MOZ_PROFILE_ORDER_FILE; r=nalexander,chmanchester
https://hg.mozilla.org/integration/autoland/rev/bd7bac95cd92
switch clang{,-cl} to use IR-based instrumentation for PGO; r=firefox-build-system-reviewers,chmanchester
https://hg.mozilla.org/integration/autoland/rev/085342ba416f
remove code for frontend-based PGO instrumentation; r=dmajor

Some windows build times have improved:

== Change summary for alert #21085 (as of Sat, 25 May 2019 00:48:46 GMT) ==

Improvements:

5% build times windows2012-32-shippable opt nightly taskcluster-c4.4xlarge 5,956.90 -> 5,648.71
4% build times windows2012-64-shippable opt nightly taskcluster-c4.4xlarge 6,398.66 -> 6,114.51

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21085

== Change summary for alert #21091 (as of Sat, 25 May 2019 16:44:23 GMT) ==

Improvements:

5% raptor-webaudio-firefox linux64-shippable-qr opt 159.50 -> 150.92
4% raptor-webaudio-firefox linux64-shippable opt 152.67 -> 145.83
3% raptor-tp6-instagram-firefox loadtime linux64-shippable opt 333.67 -> 322.29
3% raptor-speedometer-firefox linux64-shippable-qr opt 88.85 -> 91.26
2% raptor-tp6-sheets-firefox loadtime linux64-shippable-qr opt 709.96 -> 695.38

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21091

Regressions: 1555298
Regressions: 1562612
Regressions: 1590427
Depends on: 1606191
You need to log in before you can comment on or make changes to this bug.