Closed Bug 957348 Opened 10 years ago Closed 10 years ago

Android browser hangs produce randomly truncated test logs

Categories

(Testing :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: gbrown)

Details

Attachments

(2 files)

If Firefox for Android hangs during a test (certainly a mochitest, but I think I have seen this with a reftest also), the test fails with "2400 seconds without output" (seems reasonable), but the test log is truncated, seemingly randomly.

Here is a try run with an intentional hang during shutdown:

https://tbpl.mozilla.org/?tree=Try&rev=71d867ab90f0

Note that all tests should have completed, but none of the logs show that the tests have completed. Instead we see:

34182 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Input types should be ASCII case-insensitive
34183 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Checking whether an attribute i
command timed out: 2400 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=2786.626790

or

14:44:27     INFO -  34168 INFO TEST-PASS | /tests/content/base/test/test_bug813919.html | startContainer should be div.
14:44:27     INFO -  34169 INFO TEST-PASS | /tests/content/base/test/test_bug813919.html | Range should be collaped to the end of the div element.
14:44:27     INFO -  34170 INFO TEST-INFO | MEMORY STAT vsize after test: 809488384

command timed out: 2400 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3821.347764

I suppose one of our stages of log-buffering is failing to flush after a period of inactivity, but I cannot seem to find which one.
https://tbpl.mozilla.org/?tree=Try&rev=6c40104c9ae7 adds debug logging to the harness on each pullFile, which of course avoids the 2400 seconds without output failure, transforming the failures into "application ran for longer than allowed maximum time". Now the full log is displayed (why?) and the dated entries for Android 4.0 show a 20+ minute delay before the last chunk of log is seen by MultiFileLogger:

18:04:30     INFO -  34167 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS unit names should be ASCII case-insensitive
18:04:30     INFO -  34168 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS property aliases should be ASCII case-insensitive
18:04:30     INFO -  34169 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Inputs that aren't actually type='hidden' should not be allowed as table kids
18:04:30     INFO -  34170 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS property names should be ASCII case-insensitive
18:30:36     INFO -  34171 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS keywords should be ASCII case-insensitive
18:30:36     INFO -  34172 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS color names should be ASCII case-insensitive
18:30:36     INFO -  34173 INFO TEST-INFO | MEMORY STAT vsize after test: 809062400
18:30:36     INFO -  34174 INFO TEST-INFO | MEMORY STAT residentFast after test: 210219008
18:30:36     INFO -  34175 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 33437588
18:30:36     INFO -  34176 INFO TEST-END | /tests/content/base/test/test_bug820909.html | finished in 352ms
18:30:36     INFO -  34177 INFO TEST-START | Shutdown
18:30:36     INFO -  34178 INFO Passed:  32534
18:30:36     INFO -  34179 INFO Failed:  0
18:30:36     INFO -  34180 INFO Todo:    63
18:30:36     INFO -  34181 INFO Slowest: 53107ms - /tests/content/base/test/test_bug682592.html
18:30:36     INFO -  34182 INFO SimpleTest FINISHED

I do not understand what's happening here.
I don't see this timestamped logs in the try-server push.  Are we failing to terminate the process?  It seems to be the case- maybe we can force this to terminate the test harness when we receive 'SimpleTest FINISHED'
(In reply to Joel Maher (:jmaher) from comment #2)
> I don't see this timestamped logs in the try-server push.  Are we failing to
> terminate the process?  It seems to be the case- maybe we can force this to
> terminate the test harness when we receive 'SimpleTest FINISHED'

We kill the runtestsremote.py process after "2400 seconds without output" -- I think that's okay. Killing after "SimpleTest FINISHED" won't address the case of a hang mid-test. (I'm using a shutdown hang here as an example, but the randomly truncated logs affect all Android hangs.)
I understand what the problem is now.

We have a parent process - like mozharness' android_panda_unittest.py - calling "python runtestsremote.py ...", collecting the runtestsremote output and then logging that. "python runtestsremote.py"'s output is buffered, and if the parent process does not see any output in 2400 seconds, it kills the child, and loses any buffered output.

We can avoid the problem by forcing standard output to be flushed in runtestsremote.py, as seen here: https://tbpl.mozilla.org/?tree=Try&rev=aedaf11e70fc.

But I think the better solution is to call "python -u runtestsremote.py ...". We seem to use "python -u" for all of the desktop unittests, and they do have the problem of randomly truncated logs: https://tbpl.mozilla.org/?tree=Try&rev=cd5d901f706a.


Does anyone recall why we do not use the -u (unbuffered) flag for Android tests? Just an oversight?
Sigh.

> We seem to use "python -u"
> for all of the desktop unittests, and they do have the problem of randomly truncated logs

Of course I meant:

We seem to use "python -u"
for all of the desktop unittests, and they do NOT have the problem of randomly truncated logs
(In reply to Geoff Brown [:gbrown] from comment #4)
> I understand what the problem is now.
> 
> We have a parent process - like mozharness' android_panda_unittest.py -
> calling "python runtestsremote.py ...", collecting the runtestsremote output
> and then logging that. "python runtestsremote.py"'s output is buffered, and
> if the parent process does not see any output in 2400 seconds, it kills the
> child, and loses any buffered output.
> 
> We can avoid the problem by forcing standard output to be flushed in
> runtestsremote.py, as seen here:
> https://tbpl.mozilla.org/?tree=Try&rev=aedaf11e70fc.
> 
> But I think the better solution is to call "python -u runtestsremote.py
> ...". We seem to use "python -u" for all of the desktop unittests, and they
> do have the problem of randomly truncated logs:
> https://tbpl.mozilla.org/?tree=Try&rev=cd5d901f706a.
> 
> 
> Does anyone recall why we do not use the -u (unbuffered) flag for Android
> tests? Just an oversight?

I don't personally know. It's worth trying unless someone is aware of any issues with it.
Attachment #8357942 - Flags: review?(armenzg)
Attachment #8357943 - Flags: review?(armenzg)
Attachment #8357942 - Flags: review?(armenzg) → review+
Attachment #8357943 - Flags: review?(armenzg) → review+
Have these patches been tested?
I would ike to see 4.0 and x86 running these before landing the 2.2 patches as it is more of a churn to deploy (reconfigs).
Not really. I thought I would check in the mozharness patch and immediately kick off a few ash jobs to check. Reasonable?
Yep!
Verified no apparent harm on Android 4.0 and Android 4.2 x86 on Cedar. For example https://tbpl.mozilla.org/php/getParsedLog.php?id=32783016&tree=Cedar&full=1.
Should we land the 2.2 code? We could a reconfig early next week.
I think so -- it should be safe.
In production
Looks like this is working: https://tbpl.mozilla.org/php/getParsedLog.php?id=32940762&tree=Mozilla-Inbound
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: