Open Bug 969986 Opened 10 years ago Updated 2 years ago

jittest suite sometimes fails to detect and kill hangs

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

People

(Reporter: philor, Unassigned)

References

Details

Since the switch to m3.medium instances for linux64 test slaves caused linux64 asan and debug jittests to either frequently or perhaps always hang, you can see from https://tbpl.mozilla.org/php/getParsedLog.php?id=34017372&tree=Cedar and https://tbpl.mozilla.org/php/getParsedLog.php?id=34017387&tree=Cedar and https://tbpl.mozilla.org/php/getParsedLog.php?id=34355492&tree=Try that rather than noticing at any point during the 30 minutes that it sits hung with no output, we just wait until the 7200 second total job timer expires and kills the step. The proper thing for a suite that wants to be visible to do would be "TEST-UNEXPECTED-FAIL | tests/jit-test/jit-test/tests/jaeger/bug781859-1.js | Timed out after 330 seconds with no output" rather than an utterly generic and wasteful of 30 or 90 minutes "command timed out: 7200 seconds elapsed, attempting to kill".
There is a timeout parameter which defaults to 150 seconds, which I'm fairly certain I've seen work propery in the past. Not sure what went wrong here.
Where does the harness live?
It lives at: js/src/jit-test/jit_test.py.
Component: General Automation → JavaScript Engine
Product: Release Engineering → Core
QA Contact: catlee
Summary: jittest suite needs to detect and kill hangs → jittest suite sometimes fails to detect and kill hangs
I may well have been wrong on my description but right on my choice of product - someone with an m3.medium loaner would need to run the suite to see, but since jit_test.py most certainly does have timeout handling, the failure mode on m3.medium could well be "jit_test.py itself hangs, but we haven't set a particular timeout or no-output-timeout on the step running it, so we just run the whole job out to 7200 seconds."
(In reply to Phil Ringnalda (:philor) from comment #4)
> I may well have been wrong on my description but right on my choice of
> product - someone with an m3.medium loaner would need to run the suite to
> see, but since jit_test.py most certainly does have timeout handling, the
> failure mode on m3.medium could well be "jit_test.py itself hangs, but we
> haven't set a particular timeout or no-output-timeout on the step running
> it, so we just run the whole job out to 7200 seconds."
 
There is a 20 minute no output timeout that I've hit before when playing with the timeout settings (e.g. https://tbpl.mozilla.org/?tree=Try&rev=6f78e053742c&showall=1).

I initially read this as test harness bug, but if that timer only fails on m3.medium test machines then I'll be happy to move it back to the releng component.
Dunno, it's sort of inconvenient having them currently be unable to start. I see that I failed to notice that my Try log wasn't even hung at all, it started a new test during the second when it was killed for taking 7200 seconds, so that was just a jaw-droppingly slow run.
Blocks: 973900
Has this problem shown up in any recent test runs?
Flags: needinfo?(philringnalda)
No idea: Cedar is a killing field, m3.medium instances are gone, and until yesterday only ASan ran it on try.
Flags: needinfo?(philringnalda)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.