Closed Bug 1881386 Opened 2 years ago Closed 2 years ago

Fork server causes shutdown hangs on CI by creating zombies

Categories

(Core :: IPC, defect, P1)

defect

Tracking

()

RESOLVED FIXED
126 Branch
Tracking Status
firefox126 --- fixed

People

(Reporter: jld, Assigned: jld)

References

Details

Attachments

(1 file)

Several things are going on here:

  • The fork server can exit while some of its child processes are still running; this causes those processes to be reparented to pid 1
  • In CI test runs, the container's pid 1 is the run-task script, which does not wait for orphan processes as init normally does, so they accumulate in state Z (zombie)
  • When we probe the processes with kill(pid, 0) to see if they exited, zombie processes are seen as still “alive” (the call succeeds, because the pid is still assigned to a process)
  • Even when we try to SIGABRT or SIGKILL the process, that has no effect because it's no longer running, so we block forever on the parent process I/O thread, and some other watchdog timer fires and ends the test run with an error

So, on NS_FREE_PERMANENT_DATA builds on CI, shutdown can hang if the fork server is enabled. This is easily reproducible on Try for TSan builds (seemingly 100% on -spi-nw runs because of the socket process, otherwise ~50% of test jobs in the mochitest-plain suite, possibly all because of the RDD process but I haven't checked exhaustively); I thought I'd tested on regular debug builds and didn't see it, but maybe I haven't run enough tests yet. In theory this could also affect regular opt builds, because we'll SIGKILL the process but then try to wait for it to exit, and see above.

What it wouldn't affect is use on normal systems with a normal init, because the zombies will be collected, and hopefully no other process will have been assigned the same pid, and then we can continue. But, as long as it breaks on CI, it blocks rolling out to Nightly.

Options include:

  1. Changing CI to use something like tini to act as init and collect the zombies like on a normal system. This introduces a certain amount of risk (and it's possible that we have a reason for not doing it already).
  2. Add code in the fork server case of IsProcessDead to parse /proc/{pid}/stat or similar to check for Z state and consider that as dead
  3. Ensure that the fork server outlives its child processes. This is of interest for bug 1752638, although it raises some questions about what to do if the fork server crashes.
Severity: -- → S3
Priority: -- → P1

AIUI, we're migrating off docker for tests, into VMs, which presumably do have an init, so eventually the problem will go away. But yeah, ideally we'd have a init above run-task, or at least a process reaper. Technically speaking, run-task could do that...

See Also: → 1883790

This patch checks whether child processes are in the zombie state
(exited but not waited), by parsing /proc/{pid}/stat, and treats
them as dead in that case.

Child processes can end up stuck in zombie state if the fork server
exits when child processes are still running, which causes them to be
reparented to pid 1, and pid 1 isn't acting as init, which can happen
in container environments like Docker depending on configuration. In
particular, this is currently the case in the containers used by Mozilla
CI to run tests. Without this patch, if the fork server is enabled, we
wait forever for the process to exit and then (in the Mozilla CI case)
some other timeout fires and causes the test run to fail.

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fc715d4618d5 Check for zombie processes in ProcessWatcher on Linux. r=glandium
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 126 Branch
Regressions: 1888965
No longer regressions: 1888965
See Also: → 1888965
See Also: → 1926763
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: