When mozharness aborts a task due to output timeout, it doesn't terminate
Categories
(Release Engineering :: Applications: MozharnessCore, defect)
Tracking
(firefox121 fixed)
Tracking | Status | |
---|---|---|
firefox121 | --- | fixed |
People
(Reporter: glandium, Assigned: glandium)
Details
Attachments
(1 file)
See for example bug 1863666. Copy/pasting the end of its log here as an example:
[task 2023-11-08T02:26:56.079Z] 02:26:56 INFO - Automation Error: mozharness timed out after 2400 seconds running ['/builds/worker/.mozbuild/srcdirs/gecko-8a5b87fe5d69/_virtualenvs/common/bin/python', 'mach', '--log-no-times', 'build', '-v']
[taskcluster:error] Task timeout after 10800 seconds. Force killing container.
[taskcluster 2023-11-08 04:02:11.890Z] === Task Finished ===
That is, mozharness stops because of 2400 seconds without output, but then nothing happens for an hour and a half before the task actually fails.
Assignee | ||
Comment 1•1 year ago
|
||
The problem is that we're stuck on the readline in this loop:
while True:
line = p.stdout.readline()
if not line:
break
output_time = time.time()
parser.add_lines(line)
And the p.kill that happens in the timer doesn't interrupt it. Funnily enough, attaching strace to the mozharness process makes the read fail and unblocks the whole thing.
Comment 2•1 year ago
|
||
Does SIGKILL work? Maybe something like https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/testing/mozbase/mozprocess/tests/test_run_and_wait.py#13-19 ?
Assignee | ||
Comment 3•1 year ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #2)
Does SIGKILL work? Maybe something like https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/testing/mozbase/mozprocess/tests/test_run_and_wait.py#13-19 ?
I tried p.terminate() instead of p.kill(), which sends SIGKILL, and no, that doesn't work.
Assignee | ||
Comment 4•1 year ago
|
||
(In reply to Mike Hommey [:glandium] from comment #3)
(In reply to Geoff Brown [:gbrown] from comment #2)
Does SIGKILL work? Maybe something like https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/testing/mozbase/mozprocess/tests/test_run_and_wait.py#13-19 ?
I tried p.terminate() instead of p.kill(), which sends SIGKILL, and no, that doesn't work.
It's the opposite, actually, p.kill() sends SIGKILL, and that's what we already have.
Assignee | ||
Comment 5•1 year ago
|
||
The current code hits a deadlock when the timer thread kills the
subprocess, while the main thread is blocked on reading a line off the
subprocess's output: the readline call never returns, and while
mozharness prints the timeout message, it also stays there blocked, and
workers then wait for the task timeout, which is usually much larger.
The new strategy is to read from a thread, and send the read lines to
the main thread over a Queue, with a timeout.
Comment 7•1 year ago
|
||
bugherder |
Description
•