Closed Bug 861850 Opened 11 years ago Closed 8 years ago

Consider upping the per-test timeout value for b2g mochitests

Categories

(Testing :: Mochitest, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: martijn.martijn, Unassigned)

References

()

Details

Attachments

(1 file, 2 obsolete files)

Some of the mochitests on the b2g boards seem to time out, just because they take too much time to actually run.
See bug 860748, comment 15, where I noticed that dom/encoding/test/test_stringencoding.html was passing on tryserver 1 time, but the 2nd time it didn't, because it took a bit longer then.

So that could be solved by increasing the per-test timeout value for these b2g mochitests.
Or perhaps there is a way of increasing the timeout value only for tests that are known to take a long time?
Blocks: 862196
Attached patch patch (obsolete) — Splinter Review
Jonathan, is this something to consider?
This bascially changes the per-test timeout into timing out when the test isn't spewing out any test-passes anymore for a while.
Attachment #738232 - Flags: review?(jgriffin)
Comment on attachment 738232 [details] [diff] [review]
patch

Review of attachment 738232 [details] [diff] [review]:
-----------------------------------------------------------------

This timeout is actually managed on the Python side, in this block of code:  http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#747.  We wouldn't want to introduce a separate timeout mechanism in the JS.

We don't want to change the timeout for non-B2G tests, but you could change options.timeout in runtestsb2g.py.
Attachment #738232 - Flags: review?(jgriffin) → review-
options.timeout from runtests.py/runtestsb2g.py is passed along to JS here, afaik:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/tests/SimpleTest/setup.js#70
Then it's handled by TestRunner.timeout.

I'm not sure what you mean with introducing a separate timeout mechanism in the JS. I'm still using the timeout value from runtests.py, but instead of assuming that the test file is finished within that timeout period, it looks if the test file is still carrying out tests within that timeout period. If it is, it can go on, if not, the timeout code is executed.
That seems a lot more robust to me than assuming that a test file is finished with its tests within 5 minutes (which is the current value), which will not be the case on slower machines with test files that has thousands of tests in it.
It depends on the kind of timeout you're trying to catch.  For example, in one of your try jobs at https://tbpl.mozilla.org/?tree=Try&rev=077b9e7f7d51, this timeout actually occurred because the B2G process crashed.

I think just about all of the timeout failures I've seen include "application timed out after 330 seconds with no output", which is generated by the testrunner, and wouldn't be affected by your patch.

Have you seen cases where a test is producing output, but times out?  If so, your patch might help.
Attached patch patch pushed to try (obsolete) — Splinter Review
This is the patch that I pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=4d2b426d5db4
A lot of the tests that were giving me problems in order to bug 860748, I removed in the b2g.json list. So with this patch, they shouldn't time out anymore.
Tryserver gives this fail in chunk 1:
11:23:31 WARNING - TEST-UNEXPECTED-FAIL | /tests/content/base/test/test_xhr_progressevents.html | application timed out after 330 seconds with no output

That seems like a valid time out, according to the full log:
11:18:01     INFO -  received 'close' command
11:18:01     INFO -  received 'open' command
11:18:02     INFO -  received 'send' command
11:23:31     INFO -  ked-text[2]
11:23:31     INFO -  38149 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | event.loaded increased while running data for without length, as xml for moz-chunked-text[2]
11:23:31     INFO -  38150 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | event.loaded didn't increase too much while running data for without length, as xml for moz-chunked-text[2]
11:23:31     INFO -  38151 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | response strictly grew while running data for without length, as xml for moz-chunked-text[2]
11:23:31     INFO -  38152 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | sanity check for progress while running data for without length, as xml for moz-chunked-text[2]
11:23:31     INFO -  38153 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | new data matches expected while running data for without length, as xml for moz-chunked-text[2]
11:23:31     INFO -  38154 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | lengthComputable while running data for without length, as xml for moz-chunked-text[2]
11:23:31     INFO -  38155 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | should have consumed the expected result
11:23:31     INFO -  38156 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | event type while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38157 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | response should be a string while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38158 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | reflexivity should hold while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38159 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | event.loaded grew by response size while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38160 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | event.loaded increased while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38161 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | event.loaded didn't increase too much while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38162 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | response strictly grew while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38163 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | sanity check for progress while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38164 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | new data matches expected while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38165 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | lengthComputable while running data for without length, as xml for moz-chunked-text[3]
11:23:31     INFO -  38166 INFO TEST-PASS | /tests/content/base/test/test_xhr_progressevents.html | should have consumed the expected result
11:23:31  WARNING -  TEST-UNEXPECTED-FAIL | /tests/content/base/test/test_xhr_progressevents.html | application timed out after 330 seconds with no output

Anyway, this test file is already in the tree and the patch shouldn't be of any influence here.
Btw, some of the text in the log seems to get cut off with this test file running.


More worrisome is the failure in chunk 3 of the various machines (Ubuntu, Windows and Mac) at:
11:07:04 INFO - 2912 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/indexedDB/ipc/test_ipc.html | /tests/dom/indexedDB/test/test_index_object_cursors.html | Test timed out.
..for instance.
But those seem to be known time outs, afaict, it seems to be bug 860714. I don't know how my patch would affect this negatively.
It's weird that these timeouts seem to occur, there are test-passes going on just before the time out appears, but it is exactly 5 minutes when the first Test timed out when test_ipc.html has first appeared.

test_Range-set.html seems to cause a real crash on the b2g board, so that test file needs definitely stay into the b2g.json file.

test_Range-cloneContents.html, test_Range-deleteContents.html and test_Range-extractContents.html have regular failures (no time outs).

I've done a retrigger of these problematic chunks, to see if they happen again.
Retrigger of B2G Arm (VM) Mochitest 3 opt requested.
Retrigger of B2G Arm (VM) Mochitest 1 opt requested.
This is setting the timeout to a very high value, because this test doesn't ever want to time out, it says.
http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/ipc/test_ipc.html?force=1#15
SimpleTest.requestLongerTimeout(100)

And apparently, it's overriding the TestRunner.log method.
http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/ipc/test_ipc.html?force=1#56
TestRunner.log = sendTestMessage;

This stuff could very well be influenced by my patch.

But I don't know how that file works exactly, I still would think that this ok() call would cal the correct Testrunner.log() function: 
http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/ipc/test_ipc.html?force=1#79

I guess when the TestRunner.log function is overridden, the old one needs to be called in the new one, to get it working.
Retriggered chunk 1 has passed now.
Retriggered chunk 3 gave a tested timed out in test_stringencoding.html, so that one should remain in the exclude list for now, I think.

So if the test_ipc.html test can be fixed, then these test files could be put out of the exclude list:
-    "content/html/content/test/forms/test_input_sanitization.html":"",
finished in 795202ms
-    "dom/tests/mochitest/ajax/jquery/test_jQuery.html":"",
finished in 331496ms

test_Range-surroundContents.html, test_Range-mutations.html, test_addRange.html", test_Range-insertNode.html were in the problematic chunk 3.

test_addRange.html finished once in 691724ms, the other time it seemed to crash in that run. test_Range-mutations.html finished once in 434945ms. test_Range-insertNode.html finished once in 552674ms.

The other time in chunk 3, there was a crash in test_Range-set.html.

So probably better to keep those range tests in the exclude list for now.
Blocks: 862485
Attached patch patchSplinter Review
Ok, this is a new patch that seems to fix the test_ipc.html timeouts (at least, it does on my Macbook).
https://tbpl.mozilla.org/?tree=Try&rev=1e77e906de92

In TestRunner.runNextTest, I had to add another TestRunner._lastLogTime = TestRunner._currentTestStartTime to get it working. This makes sense, but since I don't know how test_ipc.html is working exactly, I can't really tell how it is fixing this time out failure.
I'm still confused on how test_ipc.html is just sitting silently for minutes before it starts sending out results. I guess the ipc message system is blocked by the constant running of mochitests in the child iframe or something?

In any case, I still think test_ipc.html could use some fixing of the TestRunner.log thing. But that is probably better done in a new bug.

The only 2 tests that can be removed from the exclude list for now are these 2:
-    "content/html/content/test/forms/test_input_sanitization.html":"",
-    "dom/tests/mochitest/ajax/jquery/test_jQuery.html":"",
Attachment #738232 - Attachment is obsolete: true
Attachment #738584 - Attachment is obsolete: true
Comment on attachment 739172 [details] [diff] [review]
patch

Tryserver all green for the relevant parts, so this patch seems to work very nicely.
Attachment #739172 - Flags: review?(jgriffin)
Comment on attachment 739172 [details] [diff] [review]
patch

The code looks fine.  I'm not sure we want to globally turn off our slow test detection, which is essentially what this does.  I assume we wanted it for a reason.

Another option would be just to dramatically increase the global timeout for mochitests on B2G.  Ted, what do you think?
Attachment #739172 - Flags: feedback?(ted)
I wouldn't say the patch is turning off slow test detection, I think it's improving it. Currently, a test file with thousands of tests in it, could be timed out, even though it would be running fine.
Comment on attachment 739172 [details] [diff] [review]
patch

Review of attachment 739172 [details] [diff] [review]:
-----------------------------------------------------------------

Overall the changes to the TestRunner.js look good, the other alternative is to use the --timeout cli flag to lengthen the timeout, this might affect that.  Could we ensure that our timeout is a valid value after making this change?  It seems that we might be able to go with a lower timeout!
I'm not exactly sure what comment 14 is asking for.
I tested b2g mochitest dom/tests/mochitest/ajax/jquery/test_jQuery.html with timeout cli flag of 6 and it timed out without the patch. With the patch, it didn't time out anymore.
Blocks: 789265
Attachment #739172 - Flags: review?(ted)
Blocks: 782720
Comment on attachment 739172 [details] [diff] [review]
patch

Review of attachment 739172 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure how I feel about this patch. It's really changing how we determine test timeouts, which has the potential to make some tests run a lot longer. I think I'd be a lot happier if we could just bump the global timeout for B2G instead. I really don't think we should be simply changing the meaning of timeout from "total time a test has run without finishing" to "time since the test last produced output".
Attachment #739172 - Flags: feedback?(ted)
Ok, what would be a good value for the test file timeout on B2G? Or should perhaps the tests that are currently running too long be split into smaller test files?
And disable the test file timeout for files that can't be split into smaller test files (like test_ipc.html)?
I don't know, we should figure out what a reasonable value is.

There's also a requestLongerTimeout function we can use if specific tests are taking too long.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #16)
> Comment on attachment 739172 [details] [diff] [review]
> patch
> 
> Review of attachment 739172 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'm not sure how I feel about this patch. It's really changing how we
> determine test timeouts, which has the potential to make some tests run a
> lot longer. I think I'd be a lot happier if we could just bump the global
> timeout for B2G instead. I really don't think we should be simply changing
> the meaning of timeout from "total time a test has run without finishing" to
> "time since the test last produced output".

I think if it were attached to a different command-line option with a different code path, rather than flatly redefining what runtime (and thus timeout) means, it might well make sense for some runs. 

A run against a very slow target--particularly with possibly broken tests--would probably be better served by this style rather than just upping the global timeout. Best case might even be a combo, where timeout since last output is used to watchdog hung tests, and global timeout is used to watchdog looping or unreasonably slow tests.

Just a thought. Changing something globally in a way that affects other existing test runs may not be ideal, but I really like the idea of this behavior as a tool that can be optionally enabled.
It's true, and I thought about that, but we do already have a per-test timeout as well as a "time without output" timeout in the JS and Python harnesses, respectively, so adding a new one seemed like it would make things even more complicated.
Comment on attachment 739172 [details] [diff] [review]
patch

Can we just try adding SimpleTest.requestLongerTimeout(3) (or more) to the affected tests?
Attachment #739172 - Flags: review?(jgriffin)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: