Web Processes not reaped leaving zombies with ASan
Categories
(Core :: IPC, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox95 | --- | fixed |
People
(Reporter: lukas.bernhard, Assigned: jld)
References
Details
Attachments
(1 file)
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0
Steps to reproduce:
Firefox Nighly ASAN with Fission enabled:
Opening an arbitrary web page assigns one of the preallocated processes as web process for the page.
Once the page is closed the Web process terminates.
Actual results:
The terminated web process is not reaped, leaving a zombie processes behind. Opening and closing the same page again leaves another zombie, so the number of zombies increases steadily.
This behavior reproduces all the time, leaving me with more than a dozen zombie processes.
strace shows the signal generated by the exiting web process:
[pid 994368] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=996657, si_uid=1000, si_status=0, si_utime=802, si_stime=69} ---
=> PID 994368 receives SIGCHLD but apparently does not call waitpid. This process corresponds to: -{IPC Launch}(994364)
Expected results:
No zombie processes should accumulate.
Comment 1•3 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::DOM: Navigation' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Updated•3 years ago
|
Comment 2•3 years ago
|
||
Perhaps this is a race in the child reaping that happens due to ASan slowdown?
We should investigate before Fission M8. Nika says the child process reaper behavior is slightly different in ASan mode.
Updated•3 years ago
|
Comment 4•3 years ago
|
||
I'm not going to have time to look into this in the near term.
Tyson, are you aware of any problem like this showing up with ASan Fission fuzzing? I feel like it would show up there.
Comment 5•3 years ago
|
||
I do see processes hanging around from time to time when I check manually. I have no idea what they are from or how to detect them when fuzzing.
Reporter | ||
Comment 6•3 years ago
|
||
I investigated a bit further.
When leaving a site (either due to navigating to a new site or by closing the tab) ProcessWatcher::EnsureProcessTerminated(pid, force=false) is reached.
A ChildLaxReaper
is created and we reach ChildLaxReaper::OnSignal
which in turn calls into ChildReaper::OnSignal
.
The call to IsProcessDead
calls into base::DidProcessCrash()
.
Since the build is configured with MOZ_ENABLE_FORKSERVER, the waitpid call in DidProcessCrash cannot be reached.
However: while MOZ_ENABLE_FORKSERVER is true, the fork-server is actually not used. How come?
The config value "dom.ipc.forkserver.enable" defaults to false (https://searchfox.org/mozilla-central/rev/bd25b6690f53e16e4b2b1aef04acd65ba1ba9b87/modules/libpref/init/StaticPrefList.yaml#2084).
Even in the default would be true, the false case should probably be handled.
Manually setting the pref to true fixes the issue for the Fission and the non-Fission case.
Turns out a self-compiled Firefox leaks zombies even if Fission is disabled (so its not a Fission problem after all).
Instead, the process shutdown logic misses the case of having a build with MOZ_ENABLE_FORKSERVER in combination with a dom.ipc.forkserver.enable=false flag.
Comment 7•3 years ago
|
||
Do you have --enable-forkserver in your MOZCONFIG? It isn't really a supported option. We only have it in the tree because some people are using it on their custom phone OS version of Firefox.
Reporter | ||
Comment 8•3 years ago
|
||
No I don't. My .mozconfig is:
CXX=clang++-11
# Enable ASan specific code and build workarounds
ac_add_options --enable-address-sanitizer
# Mandatory options required for ASan builds (both on Linux and Mac)
export MOZ_DEBUG_SYMBOLS=1
ac_add_options --enable-debug-symbols
ac_add_options --disable-install-strip
ac_add_options --disable-jemalloc
ac_add_options --disable-crashreporter
ac_add_options --disable-profiling
ac_add_options --disable-debug
ac_add_options --enable-optimize="-O2 -gline-tables-only"
ac_add_options --enable-valgrind
ac_add_options --disable-elf-hack
Reporter | ||
Comment 9•3 years ago
|
||
The code gets compiled with MOZ_ENABLE_FORKSERVER=1 though.
Comment 10•3 years ago
|
||
Not a Fission issue, removing Fission milestone tracker.
Thank you Lukas for further investigation to confirm the cause.
Updated•3 years ago
|
Comment 11•3 years ago
|
||
Gabriele, is it expected that we're building on Linux with MOZ_ENABLE_FORKSERVER? That's a little surprising to me. I'm not sure if that's the issue, or if there's something weird going on with a fork server build with the fork server disabled.
Reporter | ||
Comment 12•3 years ago
|
||
Function forkserver_default
in toolkit/moz.configure
returns True for browser builds on Linux and FreeBSD. Ticket #1470591 sets the forkserver as Fission M6 so it might be the new default?
Assignee | ||
Comment 13•3 years ago
|
||
Building the fork server by default but having it preffed off at runtime does seem to be the intent of D61255 (bug 1609881). However, the kill(pid, 0)
code in DidProcessCrash
really shouldn't be used even with the fork server — pids can be reused, so it's unlikely but possible for the check to be inaccurate — and definitely shouldn't be used instead of waitpid
when the fork server isn't active. I was aware of that code, but the current state of child process handling is complicated enough that I didn't realize the full impact.
The other thing going on here is ChildLaxReaper
vs. ChildGrimReaper
: if we're doing any kind of leak checking (MOZ_FREE_PERMANENT_DATA
; this includes ASan builds because of LSan, as well as debug builds for refcount logging), we wait forever for the process to exit rather than killing it after a timeout, and that's different code.
So here's what's going on, as far as I can tell from some local testing:
- On debug builds and ASan builds, every child process becomes a zombie until browser shutdown. Somehow this doesn't cause any test failures (or at least none that have been correctly attributed), which is its own problem.
- On non-ASan opt builds, every child process is a zombie for 2 seconds, then we uselessly send it
SIGKILL
, then we collect it.
I'm in the process of rewriting this code in bug 1658072, but once again the fork server is a problem.
Maybe we should just turn off MOZ_ENABLE_FORKSERVER
until there's a proper fix for this?
Comment 14•3 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #13)
Building the fork server by default but having it preffed off at runtime does seem to be the intent of D61255 (bug 1609881).
Yes, we wanted to eventually also enable it by default on all platforms that supported it. The memory savings it provides are quite good and in the context of Fission it'd be a boon.
I'm in the process of rewriting this code in bug 1658072, but once again the fork server is a problem.
Maybe we should just turn off
MOZ_ENABLE_FORKSERVER
until there's a proper fix for this?
I guess so if there aren't any alternatives. NI?ing Thinker who wrote the forkserver code to see if he can help.
Comment 15•3 years ago
|
||
I'd rather we not build the fork server on desktop platforms when it's not going to be enabled, particularly given that I don't think we're ever going to use the current implementation in Firefox.
Comment 16•3 years ago
|
||
I will work a patch for this before next week.
Comment 17•3 years ago
|
||
More info. Actually, the forkserver process does handle SIGCHLD and call waitpid(). But, sometime it may miss it, I am still not sure how it happen. But, in theory, if more than one SIGCHLD events happen in a very short and the signal handler does handle them in time, it may miss events.
Comment 18•3 years ago
|
||
Call waitpid() repeatly until no more process changed state.
Comment 19•3 years ago
|
||
Can anyone check if this patch fixes the issue here?
Reporter | ||
Comment 20•3 years ago
|
||
Assume one builds Firefox with fork-server enabled (default behavior, unless disabled with ac_add_options --disable-forkserver) but disabled runtime pref (also default). In this case handle_sigchld
is never established as signal handler, as this would require a call to InitForkServerProcess
(which doesn't happen due to the runtime config).
Hence changing handle_sigchld
such that waitpid
is called repeatedly will not resolve the issue. I verified this reasoning by running Firefox in the described configuration with the patch.
Comment 21•3 years ago
|
||
Lukas, Firefox is built with forkserver, but not enable it during runtime, right?
Have you ever disabled forkserver totally? I mean with --disable-forkserverk.
Comment 22•3 years ago
|
||
I have updated the patch to fallback to waitpid() if the pref is off.
Lukas, could you check it? Thanks!
Reporter | ||
Comment 23•3 years ago
|
||
Disabling the forkserver with --disable-forkserverk
prevented the zombies without any patches. I tested the revised patch and can confirm that no zombie processes linger around anymore.
Updated•3 years ago
|
Comment 24•3 years ago
|
||
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/88ae7d2397eb Handle crash with the old way if the fork server is prefed out. r=jld
Comment 25•3 years ago
|
||
Backed out changeset 88ae7d2397eb (Bug 1697429) for causing bustages in process_util_posix.cc
Log: https://treeherder.mozilla.org/logviewer?job_id=354078810&repo=autoland&lineNumber=13199
Backout: https://hg.mozilla.org/integration/autoland/rev/ef5213a4832e1ccbf3aaee441627662bfd58e543
Assignee | ||
Comment 26•3 years ago
|
||
There was a trivial problem with a #include
path; I fixed it, verified on Try, and I'm relanding.
Comment 27•3 years ago
|
||
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0d6116a0dfe3 Handle crash with the old way if the fork server is prefed out. r=jld
Comment 28•3 years ago
|
||
bugherder |
Description
•