Closed Bug 957348 Opened 12 years ago Closed 12 years ago

Android browser hangs produce randomly truncated test logs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: gbrown)

Details

Attachments

(2 files)

Add -u flag -- Android 2.2 12 years ago Geoff Brown [:gbrown] 3.55 KB, patch	armenzg : review+	Details \| Diff \| Splinter Review
Add -u flag -- Android 4.0, Android x86 12 years ago Geoff Brown [:gbrown] 1.92 KB, patch	armenzg : review+	Details \| Diff \| Splinter Review

Geoff Brown [:gbrown]

Assignee

Description

•

12 years ago

If Firefox for Android hangs during a test (certainly a mochitest, but I think I have seen this with a reftest also), the test fails with "2400 seconds without output" (seems reasonable), but the test log is truncated, seemingly randomly. Here is a try run with an intentional hang during shutdown: https://tbpl.mozilla.org/?tree=Try&rev=71d867ab90f0 Note that all tests should have completed, but none of the logs show that the tests have completed. Instead we see: 34182 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Input types should be ASCII case-insensitive 34183 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Checking whether an attribute i command timed out: 2400 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=2786.626790 or 14:44:27 INFO - 34168 INFO TEST-PASS | /tests/content/base/test/test_bug813919.html | startContainer should be div. 14:44:27 INFO - 34169 INFO TEST-PASS | /tests/content/base/test/test_bug813919.html | Range should be collaped to the end of the div element. 14:44:27 INFO - 34170 INFO TEST-INFO | MEMORY STAT vsize after test: 809488384 command timed out: 2400 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=3821.347764 I suppose one of our stages of log-buffering is failing to flush after a period of inactivity, but I cannot seem to find which one.

Geoff Brown [:gbrown]

Assignee

Comment 1

•

12 years ago

https://tbpl.mozilla.org/?tree=Try&rev=6c40104c9ae7 adds debug logging to the harness on each pullFile, which of course avoids the 2400 seconds without output failure, transforming the failures into "application ran for longer than allowed maximum time". Now the full log is displayed (why?) and the dated entries for Android 4.0 show a 20+ minute delay before the last chunk of log is seen by MultiFileLogger: 18:04:30 INFO - 34167 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS unit names should be ASCII case-insensitive 18:04:30 INFO - 34168 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS property aliases should be ASCII case-insensitive 18:04:30 INFO - 34169 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | Inputs that aren't actually type='hidden' should not be allowed as table kids 18:04:30 INFO - 34170 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS property names should be ASCII case-insensitive 18:30:36 INFO - 34171 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS keywords should be ASCII case-insensitive 18:30:36 INFO - 34172 INFO TEST-PASS | /tests/content/base/test/test_bug820909.html | CSS color names should be ASCII case-insensitive 18:30:36 INFO - 34173 INFO TEST-INFO | MEMORY STAT vsize after test: 809062400 18:30:36 INFO - 34174 INFO TEST-INFO | MEMORY STAT residentFast after test: 210219008 18:30:36 INFO - 34175 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 33437588 18:30:36 INFO - 34176 INFO TEST-END | /tests/content/base/test/test_bug820909.html | finished in 352ms 18:30:36 INFO - 34177 INFO TEST-START | Shutdown 18:30:36 INFO - 34178 INFO Passed: 32534 18:30:36 INFO - 34179 INFO Failed: 0 18:30:36 INFO - 34180 INFO Todo: 63 18:30:36 INFO - 34181 INFO Slowest: 53107ms - /tests/content/base/test/test_bug682592.html 18:30:36 INFO - 34182 INFO SimpleTest FINISHED I do not understand what's happening here.

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

12 years ago

I don't see this timestamped logs in the try-server push. Are we failing to terminate the process? It seems to be the case- maybe we can force this to terminate the test harness when we receive 'SimpleTest FINISHED'

Geoff Brown [:gbrown]

Assignee

Comment 3

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #2) > I don't see this timestamped logs in the try-server push. Are we failing to > terminate the process? It seems to be the case- maybe we can force this to > terminate the test harness when we receive 'SimpleTest FINISHED' We kill the runtestsremote.py process after "2400 seconds without output" -- I think that's okay. Killing after "SimpleTest FINISHED" won't address the case of a hang mid-test. (I'm using a shutdown hang here as an example, but the randomly truncated logs affect all Android hangs.)

Geoff Brown [:gbrown]

Assignee

Comment 4

•

12 years ago

I understand what the problem is now. We have a parent process - like mozharness' android_panda_unittest.py - calling "python runtestsremote.py ...", collecting the runtestsremote output and then logging that. "python runtestsremote.py"'s output is buffered, and if the parent process does not see any output in 2400 seconds, it kills the child, and loses any buffered output. We can avoid the problem by forcing standard output to be flushed in runtestsremote.py, as seen here: https://tbpl.mozilla.org/?tree=Try&rev=aedaf11e70fc. But I think the better solution is to call "python -u runtestsremote.py ...". We seem to use "python -u" for all of the desktop unittests, and they do have the problem of randomly truncated logs: https://tbpl.mozilla.org/?tree=Try&rev=cd5d901f706a. Does anyone recall why we do not use the -u (unbuffered) flag for Android tests? Just an oversight?

Geoff Brown [:gbrown]

Assignee

Comment 5

•

12 years ago

Sigh. > We seem to use "python -u" > for all of the desktop unittests, and they do have the problem of randomly truncated logs Of course I meant: We seem to use "python -u" for all of the desktop unittests, and they do NOT have the problem of randomly truncated logs

Armen [:armenzg]

Comment 6

•

12 years ago

(In reply to Geoff Brown [:gbrown] from comment #4) > I understand what the problem is now. > > We have a parent process - like mozharness' android_panda_unittest.py - > calling "python runtestsremote.py ...", collecting the runtestsremote output > and then logging that. "python runtestsremote.py"'s output is buffered, and > if the parent process does not see any output in 2400 seconds, it kills the > child, and loses any buffered output. > > We can avoid the problem by forcing standard output to be flushed in > runtestsremote.py, as seen here: > https://tbpl.mozilla.org/?tree=Try&rev=aedaf11e70fc. > > But I think the better solution is to call "python -u runtestsremote.py > ...". We seem to use "python -u" for all of the desktop unittests, and they > do have the problem of randomly truncated logs: > https://tbpl.mozilla.org/?tree=Try&rev=cd5d901f706a. > > > Does anyone recall why we do not use the -u (unbuffered) flag for Android > tests? Just an oversight? I don't personally know. It's worth trying unless someone is aware of any issues with it.

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

12 years ago

gbrown++

Geoff Brown [:gbrown]

Assignee

Comment 8

•

12 years ago

Attached patch Add -u flag -- Android 2.2 — Details — Splinter Review

Attachment #8357942 - Flags: review?(armenzg)

Geoff Brown [:gbrown]

Assignee

Comment 9

•

12 years ago

Attached patch Add -u flag -- Android 4.0, Android x86 — Details — Splinter Review

Attachment #8357943 - Flags: review?(armenzg)

Armen [:armenzg]

Updated

•

12 years ago

Attachment #8357942 - Flags: review?(armenzg) → review+

Armen [:armenzg]

Updated

•

12 years ago

Attachment #8357943 - Flags: review?(armenzg) → review+

Armen [:armenzg]

Comment 10

•

12 years ago

Have these patches been tested? I would ike to see 4.0 and x86 running these before landing the 2.2 patches as it is more of a churn to deploy (reconfigs).

Geoff Brown [:gbrown]

Assignee

Comment 11

•

12 years ago

Not really. I thought I would check in the mozharness patch and immediately kick off a few ash jobs to check. Reasonable?

Armen [:armenzg]

Comment 12

•

12 years ago

Yep!

Geoff Brown [:gbrown]

Assignee

Comment 13

•

12 years ago

https://hg.mozilla.org/build/mozharness/rev/7961de705a7e

Geoff Brown [:gbrown]

Assignee

Comment 14

•

12 years ago

Verified no apparent harm on Android 4.0 and Android 4.2 x86 on Cedar. For example https://tbpl.mozilla.org/php/getParsedLog.php?id=32783016&tree=Cedar&full=1.

Armen [:armenzg]

Comment 15

•

12 years ago

Should we land the 2.2 code? We could a reconfig early next week.

Geoff Brown [:gbrown]

Assignee

Comment 16

•

12 years ago

I think so -- it should be safe.

Geoff Brown [:gbrown]

Assignee

Comment 17

•

12 years ago

https://hg.mozilla.org/build/buildbotcustom/rev/daec665bb52f

Rail Aliiev [:rail]

Comment 18

•

12 years ago

In production

Geoff Brown [:gbrown]

Assignee

Comment 19

•

12 years ago

Looks like this is working: https://tbpl.mozilla.org/php/getParsedLog.php?id=32940762&tree=Mozilla-Inbound

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.