Bugzilla

Comment 4

•

3 years ago

I'm not going to have time to look into this in the near term.

Tyson, are you aware of any problem like this showing up with ASan Fission fuzzing? I feel like it would show up there.

Component: DOM: Navigation → DOM: Content Processes

Flags: needinfo?(continuation) → needinfo?(twsmith)

Summary: Fission Web Processes not reaped leaving zombies → Fission Web Processes not reaped leaving zombies with ASan

Tyson Smith [:tsmith]

Comment 5

•

3 years ago

I do see processes hanging around from time to time when I check manually. I have no idea what they are from or how to detect them when fuzzing.

Flags: needinfo?(twsmith)

Reporter

Comment 6

•

3 years ago

I investigated a bit further.
When leaving a site (either due to navigating to a new site or by closing the tab) ProcessWatcher::EnsureProcessTerminated(pid, force=false) is reached.
A ChildLaxReaper is created and we reach ChildLaxReaper::OnSignal which in turn calls into ChildReaper::OnSignal.
The call to IsProcessDead calls into base::DidProcessCrash().
Since the build is configured with MOZ_ENABLE_FORKSERVER, the waitpid call in DidProcessCrash cannot be reached.

However: while MOZ_ENABLE_FORKSERVER is true, the fork-server is actually not used. How come?
The config value "dom.ipc.forkserver.enable" defaults to false (https://searchfox.org/mozilla-central/rev/bd25b6690f53e16e4b2b1aef04acd65ba1ba9b87/modules/libpref/init/StaticPrefList.yaml#2084).
Even in the default would be true, the false case should probably be handled.
Manually setting the pref to true fixes the issue for the Fission and the non-Fission case.
Turns out a self-compiled Firefox leaks zombies even if Fission is disabled (so its not a Fission problem after all).
Instead, the process shutdown logic misses the case of having a build with MOZ_ENABLE_FORKSERVER in combination with a dom.ipc.forkserver.enable=false flag.

Comment 7

•

3 years ago

Do you have --enable-forkserver in your MOZCONFIG? It isn't really a supported option. We only have it in the tree because some people are using it on their custom phone OS version of Firefox.

Flags: needinfo?(lukas.bernhard)

Reporter

Comment 8

•

3 years ago

No I don't. My .mozconfig is:

CXX=clang++-11

# Enable ASan specific code and build workarounds
ac_add_options --enable-address-sanitizer

# Mandatory options required for ASan builds (both on Linux and Mac)
export MOZ_DEBUG_SYMBOLS=1
ac_add_options --enable-debug-symbols
ac_add_options --disable-install-strip
ac_add_options --disable-jemalloc
ac_add_options --disable-crashreporter
ac_add_options --disable-profiling

ac_add_options --disable-debug
ac_add_options --enable-optimize="-O2 -gline-tables-only"
ac_add_options --enable-valgrind

ac_add_options --disable-elf-hack

Flags: needinfo?(lukas.bernhard)

Reporter

Comment 9

•

3 years ago

The code gets compiled with MOZ_ENABLE_FORKSERVER=1 though.

Neha Kochar [:neha]

Comment 10

•

3 years ago

Not a Fission issue, removing Fission milestone tracker.
Thank you Lukas for further investigation to confirm the cause.

Fission Milestone: M8 → ---

Neha Kochar [:neha]

Updated

•

3 years ago

Summary: Fission Web Processes not reaped leaving zombies with ASan → Web Processes not reaped leaving zombies with ASan

Comment 11

•

3 years ago

Gabriele, is it expected that we're building on Linux with MOZ_ENABLE_FORKSERVER? That's a little surprising to me. I'm not sure if that's the issue, or if there's something weird going on with a fork server build with the fork server disabled.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 12

•

3 years ago

Function forkserver_default in toolkit/moz.configure returns True for browser builds on Linux and FreeBSD. Ticket #1470591 sets the forkserver as Fission M6 so it might be the new default?

Assignee

Comment 13

•

3 years ago

Building the fork server by default but having it preffed off at runtime does seem to be the intent of D61255 (bug 1609881). However, the kill(pid, 0) code in DidProcessCrash really shouldn't be used even with the fork server — pids can be reused, so it's unlikely but possible for the check to be inaccurate — and definitely shouldn't be used instead of waitpid when the fork server isn't active. I was aware of that code, but the current state of child process handling is complicated enough that I didn't realize the full impact.

The other thing going on here is ChildLaxReaper vs. ChildGrimReaper: if we're doing any kind of leak checking (MOZ_FREE_PERMANENT_DATA; this includes ASan builds because of LSan, as well as debug builds for refcount logging), we wait forever for the process to exit rather than killing it after a timeout, and that's different code.

So here's what's going on, as far as I can tell from some local testing:

On debug builds and ASan builds, every child process becomes a zombie until browser shutdown. Somehow this doesn't cause any test failures (or at least none that have been correctly attributed), which is its own problem.
On non-ASan opt builds, every child process is a zombie for 2 seconds, then we uselessly send it SIGKILL, then we collect it.

I'm in the process of rewriting this code in bug 1658072, but once again the fork server is a problem.

Maybe we should just turn off MOZ_ENABLE_FORKSERVER until there's a proper fix for this?

Assignee: nobody → jld

Severity: -- → S2

Status: UNCONFIRMED → ASSIGNED

Component: DOM: Content Processes → IPC

Ever confirmed: true

Gabriele Svelto [:gsvelto]

Comment 14

•

3 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #13)

Building the fork server by default but having it preffed off at runtime does seem to be the intent of D61255 (bug 1609881).

Yes, we wanted to eventually also enable it by default on all platforms that supported it. The memory savings it provides are quite good and in the context of Fission it'd be a boon.

I'm in the process of rewriting this code in bug 1658072, but once again the fork server is a problem.

Maybe we should just turn off MOZ_ENABLE_FORKSERVER until there's a proper fix for this?

I guess so if there aren't any alternatives. NI?ing Thinker who wrote the forkserver code to see if he can help.

Flags: needinfo?(thinker.li)

Kris Maglione [:kmag]

Comment 15

•

3 years ago

I'd rather we not build the fork server on desktop platforms when it's not going to be enabled, particularly given that I don't think we're ever going to use the current implementation in Firefox.

Comment 16

•

3 years ago

I will work a patch for this before next week.

Flags: needinfo?(thinker.li)

Comment 17

•

3 years ago

More info. Actually, the forkserver process does handle SIGCHLD and call waitpid(). But, sometime it may miss it, I am still not sure how it happen. But, in theory, if more than one SIGCHLD events happen in a very short and the signal handler does handle them in time, it may miss events.

Comment 18

•

3 years ago

Attached file Bug 1697429 - Handle crash with the old way if the fork server is prefed out. — Details

Call waitpid() repeatly until no more process changed state.

Comment 19

•

3 years ago

Can anyone check if this patch fixes the issue here?

Reporter

Comment 20

•

3 years ago

Assume one builds Firefox with fork-server enabled (default behavior, unless disabled with ac_add_options --disable-forkserver) but disabled runtime pref (also default). In this case handle_sigchld is never established as signal handler, as this would require a call to InitForkServerProcess (which doesn't happen due to the runtime config).
Hence changing handle_sigchld such that waitpid is called repeatedly will not resolve the issue. I verified this reasoning by running Firefox in the described configuration with the patch.

Comment 21

•

3 years ago

Lukas, Firefox is built with forkserver, but not enable it during runtime, right?

Have you ever disabled forkserver totally? I mean with --disable-forkserverk.

Flags: needinfo?(lukas.bernhard)

Comment 22

•

3 years ago

I have updated the patch to fallback to waitpid() if the pref is off.
Lukas, could you check it? Thanks!

Reporter

Comment 23

•

3 years ago

Disabling the forkserver with --disable-forkserverk prevented the zombies without any patches. I tested the revised patch and can confirm that no zombie processes linger around anymore.

Flags: needinfo?(lukas.bernhard)

Phabricator Automation

Updated

•

3 years ago

Attachment #9212991 - Attachment description: Bug 1697429 - Read state changes of all processes in SIGCHLD handler. → Bug 1697429 - Handle crash with the old way if the fork server is prefed out.