Closed Bug 1964600 Opened 10 months ago Closed 8 months ago

Daemonize the crash helper process on Linux and macOS

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
142 Branch
Tracking Status
firefox-esr140 --- fixed
firefox142 --- fixed

People

(Reporter: gsvelto, Assigned: gsvelto)

References

(Blocks 1 open bug)

Details

(Keywords: perf-alert)

Attachments

(12 files, 1 obsolete file)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

The crash helper process was designed to go away as soon as the main process dies, however there's been reports of the process lingering around after the main process died. The simplest way to avoid this issue is to daemonize the crash helper so that it will be re-parented to PID 1 and the main process won't have to deal with reaping it.

This is the second step towards daemonizing the crash helper process. The code
used to listen to incoming connections and messages is moved directly into the
server for better coupling. New and existing connections are now categorized
as belonging to Firefox parent process, a child process or an external process
(typically the WER service on Windows). The permission checks on the various
messages are now based on these categories, instead of the PIDs of the
crash helper clients.

This paves the way for the daemonization of the crash helper. First of all the
main process won't know the PID of the crash helper process directly, as it
won't be a child of the main process anymore. Additionally, child processes'
IPC channels will be created in the main process, and then both ends will be
handed to the crash helper and child respectively. Because of this it won't be
possible for a child to fetch the PID of the other endpoint of the pipe (all
operating systems record the process identifier of the process that created
the pipe as the endpoint, regardless of what process one endpoint is handed
to afterwards).

This patch removes all references to the crash helper PID and ways to fetch it
from the crash helper IPC machinery. See the following patches for how we
deal with authorizing certain operations now that we don't have the PIDs
anymore.

This is the second step towards daemonizing the crash helper process. The code
used to listen to incoming connections and messages is moved directly into the
server for better coupling. New and existing connections are now categorized
as belonging to Firefox parent process, a child process or an external process
(typically the WER service on Windows). The permission checks on the various
messages are now based on these categories, instead of the PIDs of the
crash helper clients.

Blocks: 1966353
Attachment #9491278 - Attachment is obsolete: true
Depends on: 1969446

This channel is currently only used on Linux to rendez-vous with the
crash helper, obtain its PID and use it to enable the process to be
dumped when the Yama LSM is enabled. It will be used to actually request
a dump in the future, when the breakpad exception handler is removed.

This is almost ready, it just needs a bit of testing and a tiny bit of code in the Linux implementation. It has grown significantly though (the patch set is probably north of 1k lines worth of changes) but it paves the way for future changes.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Attachment #9491279 - Attachment description: WIP: Bug 1964600 - Remove the crash helper PID from the IPC connectors → Bug 1964600 - Remove the crash helper PID from the IPC connectors r=afranchuk
Attachment #9491280 - Attachment description: WIP: Bug 1964600 - Move the IPC waiting and connection logic directly into the server → Bug 1964600 - Move the IPC waiting and connection logic directly into the server r=afranchuk
Attachment #9492984 - Attachment description: WIP: Bug 1964600 - Allow macOS crash helper IPC to pass file descriptors → Bug 1964600 - Allow macOS crash helper IPC to pass file descriptors r=afranchuk
Attachment #9495268 - Attachment description: WIP: Bug 1964600 - Allow Windows crash helper IPC to pass handles → Bug 1964600 - Allow Windows crash helper IPC to pass handles r=afranchuk
Attachment #9495269 - Attachment description: WIP: Bug 1964600 - Daemonize the crash helper process on Linux and macOS → Bug 1964600 - Daemonize the crash helper process on Linux and macOS r=afranchuk
Attachment #9495270 - Attachment description: WIP: Bug 1964600 - Introduce an IPC channel between child processes and the crash helper → Bug 1964600 - Introduce an IPC channel between child processes and the crash helper r=afranchuk

I was ready to land this but further testing on TC shows that the changes might impact the number of intermittent failures on macOS. Specifically it seems like this makes a known problem with utility processes worse. Under some conditions we fail to generate a minidump when we kill a utility process. See this error for how it looks like in the wild. This is a known issue and a long standing one, but it's been under the pain threshold for the longest time so we haven't addressed it yet. It is also possible that it is a Breakpad issue, which made me loathe looking at it given that we're phasing out breakpad. I'll dig into it because I cannot land this with this kind of failure rate, it'd make the macOS test runs very orange.

More digging into the test results shows that the failures are happening largely on 10.15. I probably hit some nasty corner-case in 10.15 with the new code I've introduced.

This appears to be an issue with AF_UNIX sockets on macOS 10.15 specifically. This might be a lot of hassle given even the existing code path for more recent versions of macOS contains two workarounds around an unfixed kernel bug.

Regressions: 1975853
Regressions: 1977807
QA Whiteboard: [qa-triage-done-c143/b142]
No longer regressions: 1978519
Regressions: 1981002

(In reply to Pulsebot from comment #12)

Pushed by gsvelto@mozilla.com:
https://github.com/mozilla-firefox/firefox/commit/0b2f2877c6f2
https://hg.mozilla.org/integration/autoland/rev/e792d735168b
Remove the crash helper PID from the IPC connectors r=afranchuk
https://github.com/mozilla-firefox/firefox/commit/8d6e237319bf
https://hg.mozilla.org/integration/autoland/rev/6eefae4581af
Move the IPC waiting and connection logic directly into the server
r=afranchuk
https://github.com/mozilla-firefox/firefox/commit/d8a08af8244f
https://hg.mozilla.org/integration/autoland/rev/65d1352abcb1
Allow macOS crash helper IPC to pass file descriptors r=afranchuk
https://github.com/mozilla-firefox/firefox/commit/ef48704b7a7b
https://hg.mozilla.org/integration/autoland/rev/b8224aa4253e
Allow Windows crash helper IPC to pass handles r=afranchuk
https://github.com/mozilla-firefox/firefox/commit/767696d4884e
https://hg.mozilla.org/integration/autoland/rev/96c59acf4e43
Daemonize the crash helper process on Linux and macOS r=afranchuk
https://github.com/mozilla-firefox/firefox/commit/846cb99e6fab
https://hg.mozilla.org/integration/autoland/rev/c9486da0aa0b
Introduce an IPC channel between child processes and the crash helper
r=afranchuk

Perfherder has detected a browsertime performance change from push c9486da0aa0b91e203bd0ffc33d65f799f61cbee.

If you have any questions, please reach out to a performance sheriff. Alternatively, you can find help on Slack by joining #perf-help, and on Matrix you can find help by joining #perftest.

Improvements:

Ratio Test Platform Options Absolute values (old vs new) Performance Profiles
15% speedometer3 Perf-Dashboard/SelectingRange/Async windows11-64-24h2-shippable fission webrender 1.30 -> 1.11 Before/After
3% speedometer3 NewsSite-Next/NavigateToUS/Async windows11-64-24h2-shippable fission webrender 21.12 -> 20.53 Before/After

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a performance sheriff to do that for you.

You can run all of these tests on try with ./mach try perf --alert 45932

The following documentation link provides more information about this command.

Keywords: perf-alert

This paves the way for the daemonization of the crash helper. First of all the
main process won't know the PID of the crash helper process directly, as it
won't be a child of the main process anymore. Additionally, child processes'
IPC channels will be created in the main process, and then both ends will be
handed to the crash helper and child respectively. Because of this it won't be
possible for a child to fetch the PID of the other endpoint of the pipe (all
operating systems record the process identifier of the process that created
the pipe as the endpoint, regardless of what process one endpoint is handed
to afterwards).

This patch removes all references to the crash helper PID and ways to fetch it
from the crash helper IPC machinery. See the following patches for how we
deal with authorizing certain operations now that we don't have the PIDs
anymore.

Original Revision: https://phabricator.services.mozilla.com/D251558

Attachment #9508694 - Flags: approval-mozilla-esr140?

This is the second step towards daemonizing the crash helper process. The code
used to listen to incoming connections and messages is moved directly into the
server for better coupling. New and existing connections are now categorized
as belonging to Firefox parent process, a child process or an external process
(typically the WER service on Windows). The permission checks on the various
messages are now based on these categories, instead of the PIDs of the
crash helper clients.

Original Revision: https://phabricator.services.mozilla.com/D251559

Attachment #9508695 - Flags: approval-mozilla-esr140?
Attachment #9508696 - Flags: approval-mozilla-esr140?
Attachment #9508697 - Flags: approval-mozilla-esr140?
Attachment #9508698 - Flags: approval-mozilla-esr140?

This channel is currently only used on Linux to rendez-vous with the
crash helper, obtain its PID and use it to enable the process to be
dumped when the Yama LSM is enabled. It will be used to actually request
a dump in the future, when the breakpad exception handler is removed.

Original Revision: https://phabricator.services.mozilla.com/D254047

Attachment #9508699 - Flags: approval-mozilla-esr140?
Attachment #9508694 - Flags: approval-mozilla-esr140? → approval-mozilla-esr140+
Attachment #9508695 - Flags: approval-mozilla-esr140? → approval-mozilla-esr140+
Attachment #9508696 - Flags: approval-mozilla-esr140? → approval-mozilla-esr140+
Attachment #9508697 - Flags: approval-mozilla-esr140? → approval-mozilla-esr140+
Attachment #9508698 - Flags: approval-mozilla-esr140? → approval-mozilla-esr140+
Attachment #9508699 - Flags: approval-mozilla-esr140? → approval-mozilla-esr140+
Blocks: 587729
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: