Fork server causes shutdown hangs on CI by creating zombies
Categories
(Core :: IPC, defect, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox126 | --- | fixed |
People
(Reporter: jld, Assigned: jld)
References
Details
Attachments
(1 file)
Several things are going on here:
- The fork server can exit while some of its child processes are still running; this causes those processes to be reparented to pid 1
- In CI test runs, the container's pid 1 is the
run-taskscript, which does notwaitfor orphan processes asinitnormally does, so they accumulate in stateZ(zombie) - When we probe the processes with
kill(pid, 0)to see if they exited, zombie processes are seen as still “alive” (the call succeeds, because the pid is still assigned to a process) - Even when we try to
SIGABRTorSIGKILLthe process, that has no effect because it's no longer running, so we block forever on the parent process I/O thread, and some other watchdog timer fires and ends the test run with an error
So, on NS_FREE_PERMANENT_DATA builds on CI, shutdown can hang if the fork server is enabled. This is easily reproducible on Try for TSan builds (seemingly 100% on -spi-nw runs because of the socket process, otherwise ~50% of test jobs in the mochitest-plain suite, possibly all because of the RDD process but I haven't checked exhaustively); I thought I'd tested on regular debug builds and didn't see it, but maybe I haven't run enough tests yet. In theory this could also affect regular opt builds, because we'll SIGKILL the process but then try to wait for it to exit, and see above.
What it wouldn't affect is use on normal systems with a normal init, because the zombies will be collected, and hopefully no other process will have been assigned the same pid, and then we can continue. But, as long as it breaks on CI, it blocks rolling out to Nightly.
Options include:
- Changing CI to use something like tini to act as
initand collect the zombies like on a normal system. This introduces a certain amount of risk (and it's possible that we have a reason for not doing it already). - Add code in the fork server case of
IsProcessDeadto parse/proc/{pid}/stator similar to check forZstate and consider that as dead - Ensure that the fork server outlives its child processes. This is of interest for bug 1752638, although it raises some questions about what to do if the fork server crashes.
Updated•2 years ago
|
Comment 1•2 years ago
|
||
AIUI, we're migrating off docker for tests, into VMs, which presumably do have an init, so eventually the problem will go away. But yeah, ideally we'd have a init above run-task, or at least a process reaper. Technically speaking, run-task could do that...
| Assignee | ||
Comment 2•2 years ago
|
||
This patch checks whether child processes are in the zombie state
(exited but not waited), by parsing /proc/{pid}/stat, and treats
them as dead in that case.
Child processes can end up stuck in zombie state if the fork server
exits when child processes are still running, which causes them to be
reparented to pid 1, and pid 1 isn't acting as init, which can happen
in container environments like Docker depending on configuration. In
particular, this is currently the case in the containers used by Mozilla
CI to run tests. Without this patch, if the fork server is enabled, we
wait forever for the process to exit and then (in the Mozilla CI case)
some other timeout fires and causes the test run to fail.
Comment 4•2 years ago
|
||
| bugherder | ||
Updated•2 years ago
|
Description
•