Closed Bug 1863675 Opened 1 year ago Closed 1 year ago

When mozharness aborts a task due to output timeout, it doesn't terminate

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect

Tracking

(firefox121 fixed)

RESOLVED FIXED
Tracking Status
firefox121 --- fixed

People

(Reporter: glandium, Assigned: glandium)

Details

Attachments

(1 file)

See for example bug 1863666. Copy/pasting the end of its log here as an example:

[task 2023-11-08T02:26:56.079Z] 02:26:56     INFO - Automation Error: mozharness timed out after 2400 seconds running ['/builds/worker/.mozbuild/srcdirs/gecko-8a5b87fe5d69/_virtualenvs/common/bin/python', 'mach', '--log-no-times', 'build', '-v']
[taskcluster:error] Task timeout after 10800 seconds. Force killing container.
[taskcluster 2023-11-08 04:02:11.890Z] === Task Finished ===

That is, mozharness stops because of 2400 seconds without output, but then nothing happens for an hour and a half before the task actually fails.

The problem is that we're stuck on the readline in this loop:

                while True:
                    line = p.stdout.readline()
                    if not line:
                        break
                    output_time = time.time()
                    parser.add_lines(line)

And the p.kill that happens in the timer doesn't interrupt it. Funnily enough, attaching strace to the mozharness process makes the read fail and unblocks the whole thing.

(In reply to Geoff Brown [:gbrown] from comment #2)

Does SIGKILL work? Maybe something like https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/testing/mozbase/mozprocess/tests/test_run_and_wait.py#13-19 ?

I tried p.terminate() instead of p.kill(), which sends SIGKILL, and no, that doesn't work.

(In reply to Mike Hommey [:glandium] from comment #3)

(In reply to Geoff Brown [:gbrown] from comment #2)

Does SIGKILL work? Maybe something like https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/testing/mozbase/mozprocess/tests/test_run_and_wait.py#13-19 ?

I tried p.terminate() instead of p.kill(), which sends SIGKILL, and no, that doesn't work.

It's the opposite, actually, p.kill() sends SIGKILL, and that's what we already have.

The current code hits a deadlock when the timer thread kills the
subprocess, while the main thread is blocked on reading a line off the
subprocess's output: the readline call never returns, and while
mozharness prints the timeout message, it also stays there blocked, and
workers then wait for the task timeout, which is usually much larger.

The new strategy is to read from a thread, and send the read lines to
the main thread over a Queue, with a timeout.

Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/543c915fc8c6 Change how mozharness handles output_timeout. r=releng-reviewers,gbrown
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: