Closed
Bug 957348
Opened 10 years ago
Closed 10 years ago
Android browser hangs produce randomly truncated test logs
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gbrown, Assigned: gbrown)
Details
Attachments
(2 files)
3.55 KB,
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
1.92 KB,
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
If Firefox for Android hangs during a test (certainly a mochitest, but I think I have seen this with a reftest also), the test fails with "2400 seconds without output" (seems reasonable), but the test log is truncated, seemingly randomly. Here is a try run with an intentional hang during shutdown: https://tbpl.mozilla.org/?tree=Try&rev=71d867ab90f0 Note that all tests should have completed, but none of the logs show that the tests have completed. Instead we see: 34182 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Input types should be ASCII case-insensitive 34183 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Checking whether an attribute i command timed out: 2400 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=2786.626790 or 14:44:27 INFO - 34168 INFO TEST-PASS | /tests/content/base/test/test_bug813919.html | startContainer should be div. 14:44:27 INFO - 34169 INFO TEST-PASS | /tests/content/base/test/test_bug813919.html | Range should be collaped to the end of the div element. 14:44:27 INFO - 34170 INFO TEST-INFO | MEMORY STAT vsize after test: 809488384 command timed out: 2400 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=3821.347764 I suppose one of our stages of log-buffering is failing to flush after a period of inactivity, but I cannot seem to find which one.
Assignee | ||
Comment 1•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=6c40104c9ae7 adds debug logging to the harness on each pullFile, which of course avoids the 2400 seconds without output failure, transforming the failures into "application ran for longer than allowed maximum time". Now the full log is displayed (why?) and the dated entries for Android 4.0 show a 20+ minute delay before the last chunk of log is seen by MultiFileLogger: 18:04:30 INFO - 34167 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS unit names should be ASCII case-insensitive 18:04:30 INFO - 34168 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS property aliases should be ASCII case-insensitive 18:04:30 INFO - 34169 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Inputs that aren't actually type='hidden' should not be allowed as table kids 18:04:30 INFO - 34170 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS property names should be ASCII case-insensitive 18:30:36 INFO - 34171 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS keywords should be ASCII case-insensitive 18:30:36 INFO - 34172 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS color names should be ASCII case-insensitive 18:30:36 INFO - 34173 INFO TEST-INFO | MEMORY STAT vsize after test: 809062400 18:30:36 INFO - 34174 INFO TEST-INFO | MEMORY STAT residentFast after test: 210219008 18:30:36 INFO - 34175 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 33437588 18:30:36 INFO - 34176 INFO TEST-END | /tests/content/base/test/test_bug820909.html | finished in 352ms 18:30:36 INFO - 34177 INFO TEST-START | Shutdown 18:30:36 INFO - 34178 INFO Passed: 32534 18:30:36 INFO - 34179 INFO Failed: 0 18:30:36 INFO - 34180 INFO Todo: 63 18:30:36 INFO - 34181 INFO Slowest: 53107ms - /tests/content/base/test/test_bug682592.html 18:30:36 INFO - 34182 INFO SimpleTest FINISHED I do not understand what's happening here.
Comment 2•10 years ago
|
||
I don't see this timestamped logs in the try-server push. Are we failing to terminate the process? It seems to be the case- maybe we can force this to terminate the test harness when we receive 'SimpleTest FINISHED'
Assignee | ||
Comment 3•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #2) > I don't see this timestamped logs in the try-server push. Are we failing to > terminate the process? It seems to be the case- maybe we can force this to > terminate the test harness when we receive 'SimpleTest FINISHED' We kill the runtestsremote.py process after "2400 seconds without output" -- I think that's okay. Killing after "SimpleTest FINISHED" won't address the case of a hang mid-test. (I'm using a shutdown hang here as an example, but the randomly truncated logs affect all Android hangs.)
Assignee | ||
Comment 4•10 years ago
|
||
I understand what the problem is now. We have a parent process - like mozharness' android_panda_unittest.py - calling "python runtestsremote.py ...", collecting the runtestsremote output and then logging that. "python runtestsremote.py"'s output is buffered, and if the parent process does not see any output in 2400 seconds, it kills the child, and loses any buffered output. We can avoid the problem by forcing standard output to be flushed in runtestsremote.py, as seen here: https://tbpl.mozilla.org/?tree=Try&rev=aedaf11e70fc. But I think the better solution is to call "python -u runtestsremote.py ...". We seem to use "python -u" for all of the desktop unittests, and they do have the problem of randomly truncated logs: https://tbpl.mozilla.org/?tree=Try&rev=cd5d901f706a. Does anyone recall why we do not use the -u (unbuffered) flag for Android tests? Just an oversight?
Assignee | ||
Comment 5•10 years ago
|
||
Sigh.
> We seem to use "python -u"
> for all of the desktop unittests, and they do have the problem of randomly truncated logs
Of course I meant:
We seem to use "python -u"
for all of the desktop unittests, and they do NOT have the problem of randomly truncated logs
Comment 6•10 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #4) > I understand what the problem is now. > > We have a parent process - like mozharness' android_panda_unittest.py - > calling "python runtestsremote.py ...", collecting the runtestsremote output > and then logging that. "python runtestsremote.py"'s output is buffered, and > if the parent process does not see any output in 2400 seconds, it kills the > child, and loses any buffered output. > > We can avoid the problem by forcing standard output to be flushed in > runtestsremote.py, as seen here: > https://tbpl.mozilla.org/?tree=Try&rev=aedaf11e70fc. > > But I think the better solution is to call "python -u runtestsremote.py > ...". We seem to use "python -u" for all of the desktop unittests, and they > do have the problem of randomly truncated logs: > https://tbpl.mozilla.org/?tree=Try&rev=cd5d901f706a. > > > Does anyone recall why we do not use the -u (unbuffered) flag for Android > tests? Just an oversight? I don't personally know. It's worth trying unless someone is aware of any issues with it.
Comment 7•10 years ago
|
||
gbrown++
Assignee | ||
Comment 8•10 years ago
|
||
Attachment #8357942 -
Flags: review?(armenzg)
Assignee | ||
Comment 9•10 years ago
|
||
Attachment #8357943 -
Flags: review?(armenzg)
Updated•10 years ago
|
Attachment #8357942 -
Flags: review?(armenzg) → review+
Updated•10 years ago
|
Attachment #8357943 -
Flags: review?(armenzg) → review+
Comment 10•10 years ago
|
||
Have these patches been tested? I would ike to see 4.0 and x86 running these before landing the 2.2 patches as it is more of a churn to deploy (reconfigs).
Assignee | ||
Comment 11•10 years ago
|
||
Not really. I thought I would check in the mozharness patch and immediately kick off a few ash jobs to check. Reasonable?
Comment 12•10 years ago
|
||
Yep!
Assignee | ||
Comment 13•10 years ago
|
||
https://hg.mozilla.org/build/mozharness/rev/7961de705a7e
Assignee | ||
Comment 14•10 years ago
|
||
Verified no apparent harm on Android 4.0 and Android 4.2 x86 on Cedar. For example https://tbpl.mozilla.org/php/getParsedLog.php?id=32783016&tree=Cedar&full=1.
Comment 15•10 years ago
|
||
Should we land the 2.2 code? We could a reconfig early next week.
Assignee | ||
Comment 16•10 years ago
|
||
I think so -- it should be safe.
Assignee | ||
Comment 17•10 years ago
|
||
https://hg.mozilla.org/build/buildbotcustom/rev/daec665bb52f
Comment 18•10 years ago
|
||
In production
Assignee | ||
Comment 19•10 years ago
|
||
Looks like this is working: https://tbpl.mozilla.org/php/getParsedLog.php?id=32940762&tree=Mozilla-Inbound
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•