Closed Bug 1187038 Opened 9 years ago Closed 7 years ago

Now that normal mochitest output doesn't get logged, the harness incorrectly thinks long tests are hanging & kills them with "timed out after 1000 seconds of no output"

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: dholbert, Unassigned)

References

Details

The mochitest harness has this feature where it watches the output, and if 1000 seconds go by with *zero output*, it assumes the browser is stuck/hung and kills it.

This 1000s limit is *not* intended to be an upper-limit on a mochitest's runtime.[1]  Rather, it's just a sanity-check for "if this much time has gone by without anything logged at all, then something's probably gone wrong and we better bail." This was traditionally a valid assumption, because even extremely-long tests would be logging their PASS/FAIL results.

HOWEVER -- nowadays, tests' PASS/FAIL results are redirected somewhere else by default, which means that this 1000s-watcher thinks nothing is happening and kills them.

This is bad.

[1] (We have a separate limit for test runtime, and it's configurable on a per-test basis with  SimpleTest.requestLongerTimeout().)
EXPECTED RESULTS: Mochitests that are actively running tests (such that e.g. INFO TEST-PASS is displayed on my local terminal) should not run up against the "timed out after 1000 seconds of no output" time-limit, unless they actually go 1000 seconds between subtests.

I don't know how exactly to accomplish this -- I know INFO TEST-PASS lines are suppressed on automated test runs on TreeHerder, but I don't know exactly how. If those lines are still logged *somewhere*, then mozharness should just monitor that "somewhere". Or, if the INFO TEST-PASS logging is completely disabled, then maybe we could update the timestamp on some file each time a INFO TEST-PASS *would* have been logged, and monitor that file until 1000 seconds has gone by without a timestamp modification.

In any case -- right now we *accidentally* have 1000 seconds as a hard upper-limit on mochitest-runtime, and we should fix that one way or another.
Component: General Automation → Mozharness
QA Contact: catlee → jlund
For reference, here's a Try run with a dummy mochitest which takes more than 1000 seconds to complete (and uses requestLongerTimeout to opt-in to having a long run time):
 https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fde2acab10e

Note the red M5 runs on Linux Opt. (Other platforms aren't done yet).  This is due to this bug. (Linux debug happens to be saved due to some background noise appearing as "output" during the 1000 seconds.)

If I add some stupid "dump()" logging, then I can work around this by creating "output":
 https://treeherder.mozilla.org/#/jobs?repo=try&revision=63a46fc4ac28

This demonstrates that the "dump()" is sufficient to hack around this 1000s limit, which is kinda silly.
This isn't a theoretical problem -- this appears to be what's causing reds in test_extra_inherit_initial.html on B2G debug (bug 1168222).
Blocks: 1168222
Yeah, stdout is a stupid way to figure out if something is still running or not. But iirc, this is actually a buildbot limitation (the mozharness timeout exists so that we can terminate the job gracefully instead of being hard killed by buildbot).

Maybe we can get mozharness to use the structured log to determine if the job is still running, and if it is, print a '.' or something every couple of minutes. This might also be fixed by switching to taskcluster, but if this is causing problems right now, we shouldn't block on the migration.
The 1000 second timeout is hardcoded in mozharness, why don't we just bump that above the actual upper limit for a legitimate test file duration?
(In reply to Chris Manchester [:chmanchester] from comment #5)
> The 1000 second timeout is hardcoded in mozharness, why don't we just bump
> that above the actual upper limit for a legitimate test file duration?

There is no "actual upper limit for a legitimate test file duration". Tests default to failing if they exceed 330 seconds (IIRC), but that's configurable on a per-test basis, inside the test, as noted at the end of comment 0.

The test where we're hitting this in real life, test_extra_inherit_initial.html, uses requestLongerTimeout(4) which gets it a limit of 1320 seconds. But it could just as well be using requestLongerTimeout(5) for 1650 seconds (and it's conceivable that it might need to someday, if we add a new/slower tier-1 test platform).

So, unless this mozharness 1000-sec timeout can actually be monitoring some useful measure of test-activity (per comment 4), it ends up being a second layer of test-duration-time-upper-limit, which is unfortunate & awkward.
FWIW, we can hack around this on a one-off basis by adding periodic "dump()" lines; this produces "output" that mozharness can see.  (My Try runs in comment 2 demonstrate this hackaround, for example.)

Probably not a good thing to depend on indefinitely, but it might be the best short-term fix (depending on how evil/hacky the "." solution from comment 4 would be) until we have a better solution via taskcluster.
(In reply to Daniel Holbert [:dholbert] from comment #7)
> FWIW, we can hack around this on a one-off basis by adding periodic "dump()"
> lines; this produces "output" that mozharness can see.  (My Try runs in
> comment 2 demonstrate this hackaround, for example.)
> 
> Probably not a good thing to depend on indefinitely, but it might be the
> best short-term fix (depending on how evil/hacky the "." solution from
> comment 4 would be) until we have a better solution via taskcluster.

SimpleTest.requestCompleteLog() circumvents output suppression for a single test, that also might work for a short term fix.
Ah, thanks.

(That'd probably cause too much log-bloat in this case, since this test does a loooot of sub-tests - I count 53256 instances of "TEST-PASS" from a local mochitest run with just this testcase.  In contrast, one dump("hello") per CSS property (in the test's outer loop) would get us a few hundred lines of output, so that's probably a preferable hackaround, if we need it.)
Some kind of maximum runtime is needed to prevent broken jobs from hogging all the machines.

We can disable the maximum-time-without-output and rely on maximum-runtime instead if that would help.
It's not clear how we can move forward here. Please re-open if you think there's something concrete we should do.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.