Closed Bug 1697429 Opened 3 years ago Closed 3 years ago

Web Processes not reaped leaving zombies with ASan

Categories

(Core :: IPC, defect)

Firefox 88
defect

Tracking

()

RESOLVED FIXED
95 Branch
Tracking Status
firefox95 --- fixed

People

(Reporter: lukas.bernhard, Assigned: jld)

References

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0

Steps to reproduce:

Firefox Nighly ASAN with Fission enabled:
Opening an arbitrary web page assigns one of the preallocated processes as web process for the page.
Once the page is closed the Web process terminates.

Actual results:

The terminated web process is not reaped, leaving a zombie processes behind. Opening and closing the same page again leaves another zombie, so the number of zombies increases steadily.
This behavior reproduces all the time, leaving me with more than a dozen zombie processes.

strace shows the signal generated by the exiting web process:
[pid 994368] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=996657, si_uid=1000, si_status=0, si_utime=802, si_stime=69} ---

=> PID 994368 receives SIGCHLD but apparently does not call waitpid. This process corresponds to: -{IPC Launch}(994364)

Expected results:

No zombie processes should accumulate.

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Navigation' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → DOM: Navigation
Product: Firefox → Core
Fission Milestone: --- → ?

Perhaps this is a race in the child reaping that happens due to ASan slowdown?

We should investigate before Fission M8. Nika says the child process reaper behavior is slightly different in ASan mode.

Severity: -- → S3
Fission Milestone: ? → M8
Priority: -- → P3
Severity: S3 → --
Priority: P3 → --

Andrew, can you investigate further?

Flags: needinfo?(continuation)

I'm not going to have time to look into this in the near term.

Tyson, are you aware of any problem like this showing up with ASan Fission fuzzing? I feel like it would show up there.

Component: DOM: Navigation → DOM: Content Processes
Flags: needinfo?(continuation) → needinfo?(twsmith)
Summary: Fission Web Processes not reaped leaving zombies → Fission Web Processes not reaped leaving zombies with ASan

I do see processes hanging around from time to time when I check manually. I have no idea what they are from or how to detect them when fuzzing.

Flags: needinfo?(twsmith)

I investigated a bit further.
When leaving a site (either due to navigating to a new site or by closing the tab) ProcessWatcher::EnsureProcessTerminated(pid, force=false) is reached.
A ChildLaxReaper is created and we reach ChildLaxReaper::OnSignal which in turn calls into ChildReaper::OnSignal.
The call to IsProcessDead calls into base::DidProcessCrash().
Since the build is configured with MOZ_ENABLE_FORKSERVER, the waitpid call in DidProcessCrash cannot be reached.

However: while MOZ_ENABLE_FORKSERVER is true, the fork-server is actually not used. How come?
The config value "dom.ipc.forkserver.enable" defaults to false (https://searchfox.org/mozilla-central/rev/bd25b6690f53e16e4b2b1aef04acd65ba1ba9b87/modules/libpref/init/StaticPrefList.yaml#2084).
Even in the default would be true, the false case should probably be handled.
Manually setting the pref to true fixes the issue for the Fission and the non-Fission case.
Turns out a self-compiled Firefox leaks zombies even if Fission is disabled (so its not a Fission problem after all).
Instead, the process shutdown logic misses the case of having a build with MOZ_ENABLE_FORKSERVER in combination with a dom.ipc.forkserver.enable=false flag.

Do you have --enable-forkserver in your MOZCONFIG? It isn't really a supported option. We only have it in the tree because some people are using it on their custom phone OS version of Firefox.

Flags: needinfo?(lukas.bernhard)

No I don't. My .mozconfig is:

CXX=clang++-11

# Enable ASan specific code and build workarounds
ac_add_options --enable-address-sanitizer

# Mandatory options required for ASan builds (both on Linux and Mac)
export MOZ_DEBUG_SYMBOLS=1
ac_add_options --enable-debug-symbols
ac_add_options --disable-install-strip
ac_add_options --disable-jemalloc
ac_add_options --disable-crashreporter
ac_add_options --disable-profiling

ac_add_options --disable-debug
ac_add_options --enable-optimize="-O2 -gline-tables-only"
ac_add_options --enable-valgrind

ac_add_options --disable-elf-hack
Flags: needinfo?(lukas.bernhard)

The code gets compiled with MOZ_ENABLE_FORKSERVER=1 though.

Not a Fission issue, removing Fission milestone tracker.
Thank you Lukas for further investigation to confirm the cause.

Fission Milestone: M8 → ---
Summary: Fission Web Processes not reaped leaving zombies with ASan → Web Processes not reaped leaving zombies with ASan

Gabriele, is it expected that we're building on Linux with MOZ_ENABLE_FORKSERVER? That's a little surprising to me. I'm not sure if that's the issue, or if there's something weird going on with a fork server build with the fork server disabled.

Function forkserver_default in toolkit/moz.configure returns True for browser builds on Linux and FreeBSD. Ticket #1470591 sets the forkserver as Fission M6 so it might be the new default?

Building the fork server by default but having it preffed off at runtime does seem to be the intent of D61255 (bug 1609881). However, the kill(pid, 0) code in DidProcessCrash really shouldn't be used even with the fork server — pids can be reused, so it's unlikely but possible for the check to be inaccurate — and definitely shouldn't be used instead of waitpid when the fork server isn't active. I was aware of that code, but the current state of child process handling is complicated enough that I didn't realize the full impact.

The other thing going on here is ChildLaxReaper vs. ChildGrimReaper: if we're doing any kind of leak checking (MOZ_FREE_PERMANENT_DATA; this includes ASan builds because of LSan, as well as debug builds for refcount logging), we wait forever for the process to exit rather than killing it after a timeout, and that's different code.

So here's what's going on, as far as I can tell from some local testing:

  • On debug builds and ASan builds, every child process becomes a zombie until browser shutdown. Somehow this doesn't cause any test failures (or at least none that have been correctly attributed), which is its own problem.
  • On non-ASan opt builds, every child process is a zombie for 2 seconds, then we uselessly send it SIGKILL, then we collect it.

I'm in the process of rewriting this code in bug 1658072, but once again the fork server is a problem.

Maybe we should just turn off MOZ_ENABLE_FORKSERVER until there's a proper fix for this?

Assignee: nobody → jld
Severity: -- → S2
Status: UNCONFIRMED → ASSIGNED
Component: DOM: Content Processes → IPC
Ever confirmed: true

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #13)

Building the fork server by default but having it preffed off at runtime does seem to be the intent of D61255 (bug 1609881).

Yes, we wanted to eventually also enable it by default on all platforms that supported it. The memory savings it provides are quite good and in the context of Fission it'd be a boon.

I'm in the process of rewriting this code in bug 1658072, but once again the fork server is a problem.

Maybe we should just turn off MOZ_ENABLE_FORKSERVER until there's a proper fix for this?

I guess so if there aren't any alternatives. NI?ing Thinker who wrote the forkserver code to see if he can help.

Flags: needinfo?(thinker.li)

I'd rather we not build the fork server on desktop platforms when it's not going to be enabled, particularly given that I don't think we're ever going to use the current implementation in Firefox.

I will work a patch for this before next week.

Flags: needinfo?(thinker.li)

More info. Actually, the forkserver process does handle SIGCHLD and call waitpid(). But, sometime it may miss it, I am still not sure how it happen. But, in theory, if more than one SIGCHLD events happen in a very short and the signal handler does handle them in time, it may miss events.

Call waitpid() repeatly until no more process changed state.

Can anyone check if this patch fixes the issue here?

Assume one builds Firefox with fork-server enabled (default behavior, unless disabled with ac_add_options --disable-forkserver) but disabled runtime pref (also default). In this case handle_sigchld is never established as signal handler, as this would require a call to InitForkServerProcess (which doesn't happen due to the runtime config).
Hence changing handle_sigchld such that waitpid is called repeatedly will not resolve the issue. I verified this reasoning by running Firefox in the described configuration with the patch.

Lukas, Firefox is built with forkserver, but not enable it during runtime, right?

Have you ever disabled forkserver totally? I mean with --disable-forkserverk.

Flags: needinfo?(lukas.bernhard)

I have updated the patch to fallback to waitpid() if the pref is off.
Lukas, could you check it? Thanks!

Disabling the forkserver with --disable-forkserverk prevented the zombies without any patches. I tested the revised patch and can confirm that no zombie processes linger around anymore.

Flags: needinfo?(lukas.bernhard)
Attachment #9212991 - Attachment description: Bug 1697429 - Read state changes of all processes in SIGCHLD handler. → Bug 1697429 - Handle crash with the old way if the fork server is prefed out.
See Also: → 1687376
See Also: → 1734737
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/88ae7d2397eb
Handle crash with the old way if the fork server is prefed out. r=jld

There was a trivial problem with a #include path; I fixed it, verified on Try, and I'm relanding.

Flags: needinfo?(jld)
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0d6116a0dfe3
Handle crash with the old way if the fork server is prefed out. r=jld
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch
See Also: → 1814551
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: