Open Bug 1558205 Opened 5 years ago Updated 2 years ago

Shippable windows builds fail due to corrupted llvm profile data when the socket process is enabled

Categories

(Firefox Build System :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: bwc, Unassigned)

Details

(Keywords: in-triage)

As seen in https://bugzilla.mozilla.org/show_bug.cgi?id=1555792#c16

The logging doesn't give any clues as to what has gone wrong, unfortunately. The profraw files aren't kept as artifacts, so it is hard to tell what is wrong with them.

glandium, can you help debug this? It's blocking landing a Fission feature we were hoping to demo at the all-hands next week.

Flags: needinfo?(mh+mozilla)

Here's a try with the profraw files uploaded as artifacts:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd406e4ea653e5cd1cf15d356d53c2ffdbc1b195

The one llvm-profdata barfs about seems to be truncated. Is the socket process being killed non-gracefully too early?

Flags: needinfo?(mh+mozilla) → needinfo?(docfaraday)
Keywords: in-triage

(In reply to Mike Hommey [:glandium] from comment #2)

Here's a try with the profraw files uploaded as artifacts:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd406e4ea653e5cd1cf15d356d53c2ffdbc1b195

The one llvm-profdata barfs about seems to be truncated. Is the socket process being killed non-gracefully too early?

I don't see why it would be. I don't see any sign in the logging that a crash has happened, and the socket process isn't crashing when running the mochitest suite (the spi tests that run on all platforms but android). What are we running to gather this profile data anyway?

Flags: needinfo?(docfaraday) → needinfo?(mh+mozilla)

That doesn't need to be a crash.

What are we running to gather this profile data anyway?

See build/pgo/profileserver.py

Flags: needinfo?(mh+mozilla)

Ok, I see that file pointing the binary at an index.html, but what is in that file? What code is the binary running? I can see where it puts its logfile, and I see errors like the following:

Without socket process: https://taskcluster-artifacts.net/ERYF9hEsSC-aw-oKUCNg2w/0/public/build/profile-run-2.log
With socket process: https://taskcluster-artifacts.net/d9YnuCmiQ6aysynXT8oavg/0/public/build/profile-run-2.log

These look pretty similar to me, and they both look like something in IPC is broken. The logfiles are pretty minimal, so it is hard to tell what else might be going on.

I think I'm going to have to hard-code the socket process prefs off in the command line args in build/pgo/profileserver.py.

It looks like the patch from bug 1557762 to define NS_FREE_PERMANENT_DATA makes this work (see [1] for the try push). I'd guess something in one of the #ifdefs using that define is necessary to shutdown cleanly, and without it process dies while writing out the profile data and causes the merge to fail. NS_FREE_PERMANENT_DATA is supposed to be defined if MOZ_PROFILE_GENERATE is defined [2], but that is only defined properly on 3-tier PGO builds [3], which I guess is why glandium's push with bug 1557785 worked.

I'd recommend we land the workaround in bug 1557762 for now as it sounds like that'll unblock some things until everything is moved into the 3-tier model.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=92857b5d96ab1b1098d5cb1c93bdb60080aa8806
[2] https://searchfox.org/mozilla-central/rev/928742d3ea30e0eb4a8622d260041564d81a8468/xpcom/base/nscore.h#177
[3] https://searchfox.org/mozilla-central/rev/928742d3ea30e0eb4a8622d260041564d81a8468/build/moz.configure/toolchain.configure#1471

No longer blocks: 1555792
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.