Shippable windows builds fail due to corrupted llvm profile data when the socket process is enabled
Categories
(Firefox Build System :: General, defect)
Tracking
(Not tracked)
People
(Reporter: bwc, Unassigned)
Details
(Keywords: in-triage)
As seen in https://bugzilla.mozilla.org/show_bug.cgi?id=1555792#c16
The logging doesn't give any clues as to what has gone wrong, unfortunately. The profraw files aren't kept as artifacts, so it is hard to tell what is wrong with them.
Comment 1•5 years ago
|
||
glandium, can you help debug this? It's blocking landing a Fission feature we were hoping to demo at the all-hands next week.
Comment 2•5 years ago
|
||
Here's a try with the profraw files uploaded as artifacts:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd406e4ea653e5cd1cf15d356d53c2ffdbc1b195
The one llvm-profdata barfs about seems to be truncated. Is the socket process being killed non-gracefully too early?
Reporter | ||
Comment 3•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #2)
Here's a try with the profraw files uploaded as artifacts:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd406e4ea653e5cd1cf15d356d53c2ffdbc1b195The one llvm-profdata barfs about seems to be truncated. Is the socket process being killed non-gracefully too early?
I don't see why it would be. I don't see any sign in the logging that a crash has happened, and the socket process isn't crashing when running the mochitest suite (the spi tests that run on all platforms but android). What are we running to gather this profile data anyway?
Comment 4•5 years ago
|
||
That doesn't need to be a crash.
What are we running to gather this profile data anyway?
See build/pgo/profileserver.py
Reporter | ||
Comment 5•5 years ago
|
||
Ok, I see that file pointing the binary at an index.html, but what is in that file? What code is the binary running? I can see where it puts its logfile, and I see errors like the following:
Without socket process: https://taskcluster-artifacts.net/ERYF9hEsSC-aw-oKUCNg2w/0/public/build/profile-run-2.log
With socket process: https://taskcluster-artifacts.net/d9YnuCmiQ6aysynXT8oavg/0/public/build/profile-run-2.log
These look pretty similar to me, and they both look like something in IPC is broken. The logfiles are pretty minimal, so it is hard to tell what else might be going on.
I think I'm going to have to hard-code the socket process prefs off in the command line args in build/pgo/profileserver.py.
Comment 6•5 years ago
|
||
So... I did a try push with bug 1557785 applied and... it didn't fail.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e534ef5a867486430a77b78eabba97375dbcf07
And the same base changeset fails with bug 1557785 not applied:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a96ac3bcf9e542a502a536705e1928159be19b00
So... bets are still up.
Comment 7•5 years ago
|
||
It looks like the patch from bug 1557762 to define NS_FREE_PERMANENT_DATA makes this work (see [1] for the try push). I'd guess something in one of the #ifdefs using that define is necessary to shutdown cleanly, and without it process dies while writing out the profile data and causes the merge to fail. NS_FREE_PERMANENT_DATA is supposed to be defined if MOZ_PROFILE_GENERATE is defined [2], but that is only defined properly on 3-tier PGO builds [3], which I guess is why glandium's push with bug 1557785 worked.
I'd recommend we land the workaround in bug 1557762 for now as it sounds like that'll unblock some things until everything is moved into the 3-tier model.
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=92857b5d96ab1b1098d5cb1c93bdb60080aa8806
[2] https://searchfox.org/mozilla-central/rev/928742d3ea30e0eb4a8622d260041564d81a8468/xpcom/base/nscore.h#177
[3] https://searchfox.org/mozilla-central/rev/928742d3ea30e0eb4a8622d260041564d81a8468/build/moz.configure/toolchain.configure#1471
Updated•2 years ago
|
Description
•