Closed Bug 1658072 Opened 5 years ago Closed 7 months ago

Rewrite the POSIX ProcessWatcher to not use libevent's signal handling

Categories

(Core :: IPC, defect)

defect

Tracking

()

RESOLVED FIXED
136 Branch
Tracking Status
firefox136 --- fixed

People

(Reporter: jld, Assigned: jld)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

Bug 1616462 has highlighted some thread-safety issues in libevent's signal handling that would be difficult to fix. However, we're using it only to catch SIGCHLD in ProcessWatcher::EnsureTermination, which could be simplified with a rewrite anyway.

So I've written a replacement that doesn't use libevent, and lets us delete several hundred lines of glue code in the process. Currently it needs more testing and cleanup.

See Also: → 1658474

Triaging through all the tsan backlog -- whatever happened with this? Did it not work out, or just get backlogged?

Flags: needinfo?(jld)

I have patches for this and they seem to work, but they need cleanup (and there are issues with the fork server, which I need to figure out a solution for).

Flags: needinfo?(jld)
See Also: → 1750044

This patch rewrites the Unix backend of ProcessWatcher for two reasons:

  1. To remove the use of libevent's signal handling, which has concurrency
    bugs that can't be easily fixed upstream (see Bugzilla for details)

  2. To simplify the code in general; in particular, the new version has
    one place where the process and its exit status are consumed from the OS

The new implementation uses the same pipe-to-self technique as libevent
(and which we use elsewhere) to deal with async signal safety. Unlike
the previous version, there is a single object which manages all
monitored child processes rather than one each. (Previously, this
multiplexing was done inside libevent.)

The function DidProcessCrash is now dead code. Before the ProcessWatcher
rewrite, its return value (i.e., whether the process crashed) was never
used, so effectively its only purpose was to make it harder to understand
where the waitpid calls were happening.

Now that we're no longer using libevent's signal handling, we don't need
an OO wrapper for it.

Given that libevent's signal handling code is known to have race
conditions, and there are fundamental issues that make it hard to fix
upstream, and previous patches have removed our last usage of it, we
should assert that it's no longer used.

Blocks: 1769821
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/aaff4b4fd82e Rewrite the Unix implementation of IPC process termination handling. r=nika
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e0e98ee85f98 Cleanup: remove the now-dead DidProcessCrash function. r=nika https://hg.mozilla.org/integration/autoland/rev/a942be3d053d Cleanup: remove the signal handling glue in the IPC event loop. r=nika https://hg.mozilla.org/integration/autoland/rev/199d3ecfe13c Cleanup: assert that libevent's signal handling code is never used. r=nika

Backed out for causing xpcshell failures on process_watcher_posix_sigchld.cc

Backout link // Backout link for the following changesets as they are related to the previous backout

Push with failures

Failure log

Flags: needinfo?(jld)

What's going on here is that this code to use NTLM authentication on Unix uses PR_CreateProcess, which installs its own SIGCHLD handler. In this test case, where we use the socket process for HTTP (not the default yet), first that happens and then later IPC uses ProcessWatcher when the socket process exits, during which we try to set our SIGCHLD handler and fail an assertion that there isn't an existing handler.

The problem is, if that happens in the opposite order, NSPR will remove our signal handler; it has a similar assertion but I'm pretty sure it's not used because of this preprocessor stuff, which is expecting the semi-broken threading support from Linux 2.4 (obsoleted by NPTL in 2004). Also, NSPR's child process handling conflicts with anything else; see bug 227246 (which also has some comments about the NPTL issue).

So I think that NTLM code needs to changed to use something other than NSPR (so, probably IPC), but also I might want to try adding assertions that IPC's SIGCHLD handler isn't replaced. (That's not necessarily a problem in general, if the new handler also runs the old one, but NSPR doesn't do that.)

Flags: needinfo?(jld)

This patch rearranges the abstraction around waitpid/waitid. There
is now a function WaitForProcess that returns more detailed information
about the process status, and then a wrapper IsProcessDead which is
private to the ProcessWatcher which simplifies the results and logs
information to stderr. In the future, the extended information could be
exposed to observers within Gecko in some way.

Attachment #9268176 - Attachment is obsolete: true
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a43887934808 Preamble: refactor IsProcessDead. r=ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/35b64d418961 Rewrite the Unix implementation of IPC process termination handling. r=nika https://hg.mozilla.org/integration/autoland/rev/1618e1a59f87 Cleanup: remove the signal handling glue in the IPC event loop. r=nika https://hg.mozilla.org/integration/autoland/rev/bab46fe43af6 Cleanup: assert that libevent's signal handling code is never used. r=nika https://hg.mozilla.org/integration/autoland/rev/906d7fc507a3 apply code formatting via Lando

Backed out for causing build bustages @process_watcher_posix_sigchld.cc.

Flags: needinfo?(jld)
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b67107ba5ac6 Preamble: refactor IsProcessDead. r=ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/6d4d0ca2be21 Rewrite the Unix implementation of IPC process termination handling. r=nika https://hg.mozilla.org/integration/autoland/rev/cd287d6fb098 Cleanup: remove the signal handling glue in the IPC event loop. r=nika https://hg.mozilla.org/integration/autoland/rev/e3c3c9805387 Cleanup: assert that libevent's signal handling code is never used. r=nika
Flags: needinfo?(jld)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: