Closed Bug 1105388 Opened 10 years ago Closed 10 years ago

Crashes during Robocop shutdown

Categories

(Firefox for Android Graveyard :: Testing, defect)

All
Android
defect
Not set
normal

Tracking

(firefox35 fixed, firefox36 fixed, firefox37 fixed, firefox38 fixed, firefox-esr31 fixed)

RESOLVED FIXED
Firefox 38
Tracking Status
firefox35 --- fixed
firefox36 --- fixed
firefox37 --- fixed
firefox38 --- fixed
firefox-esr31 --- fixed

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

Attachments

(1 file)

Many of our current Robocop crashes happen at shutdown time. 

Logs show:

 I/GeckoDump(  499): Robocop:Quit received -- requesting quit

which indicates that nsAppStartup::Quit() is being called. But 7000 ms later, shutdown is not complete, Robocop attempts to shutdown more aggressively, and a crash is usually reported, often with [@ libui.so + 0x1befe] in the stack.

I suspect a shutdown hang is the root cause. I thought bug 1020345 might be to blame, but it is not that.
(In reply to Geoff Brown [:gbrown] from comment #0)
> I suspect a shutdown hang is the root cause. 

Bug 1103833 might reveal something.
I wonder if a different method of exiting would help. System.exit() instead of finishOpenedActivities()? Some robotium examples use getActivity().finish() + solo.finishOpenedActivities().
Try push with extra shutdown logging: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=8c0476c60b8d

Some observations:
 - On Android x86, following a robocop test, browser shutdown completes in about 400 ms, with little variation.
 - On Android 4.0, browser shutdown completes in about 1000 ms, with little variation.
 - On Android 2.3, the minimum shutdown time is about 4000 ms, but varies widely and can take more than 15000 ms (the maximum that this try patch waits for). I don't see evidence of a hang; it looks more like shutdown is proceeding very slowly.

On all platforms, there is sometimes a SIGSEGV following System.exit; no crash report is generated and the test passes. This is more common on 2.3.

On all platforms, this is often reported late in shutdown:
  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost

On 2.3, some runs have:
12-07 20:49:46.874 I/Gecko   ( 1003): *************************
12-07 20:49:46.874 I/Gecko   ( 1003): A coding exception was thrown and uncaught in a Task.
12-07 20:49:46.874 I/Gecko   ( 1003): 
12-07 20:49:46.874 I/Gecko   ( 1003): Full message: TypeError: tab.browser.contentWindow is undefined
12-07 20:49:46.885 I/Gecko   ( 1003): Full stack: Reader.parseDocumentFromTab<@chrome://browser/content/Reader.js:202:9
12-07 20:49:46.885 I/Gecko   ( 1003): TaskImpl_run@resource://gre/modules/Task.jsm:314:40
12-07 20:49:46.885 I/Gecko   ( 1003): Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:870:23
12-07 20:49:46.885 I/Gecko   ( 1003): this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:749:7
12-07 20:49:46.885 I/Gecko   ( 1003): this.PromiseWalker.scheduleWalkerLoop/<@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:691:37
12-07 20:49:46.885 I/Gecko   ( 1003): 
12-07 20:49:46.885 I/Gecko   ( 1003): *************************
12-07 20:49:46.895 W/GeckoConsole( 1003): [JavaScript Error: "Error parsing document from tab: TypeError: tab.browser.contentWindow is undefined" {file: "chrome://browser/content/browser.js" line: 4250}]
I found a special-case: testFilterOpenTab. This test does not wait for Gecko:Ready or any other startup notification, and it completes quickly (generally less than 500 ms). As a result, the test may well complete before startup is complete. Consider these logcat extracts from Android 2.3:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=4fe68fe2daa1
http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/52d6393e6c0b23bf111e889f0e19735382fc2a2cd0f68094651062c156164ab5336b87aed32bb34d7b9ff50fc466bf7af0c24acc5c39fda484f706d6ed10c94a

12-08 13:56:49.559 I/ActivityManager(   64): Start proc org.mozilla.fennec for added application org.mozilla.fennec: pid=2073 uid=10017 gids={3003, 1015, 1006}
...
12-08 13:57:07.789 I/Robocop ( 2073): {"message":"finished in 539ms","time":1418075827785,"source":"robocop","status":"OK","test":"testFilterOpenTab - TestInsertLocalTabs","thread":null,"action":"test_end","pid":null}
12-08 13:57:07.789 I/Robocop ( 2073): {"action":"log","message":"TEST-START | Shutdown","time":1418075827791,"pid":null,"level":"info","source":"robocop","thread":null}
12-08 13:57:07.810 I/Robocop ( 2073): {"action":"log","message":"Passed: 2","time":1418075827806,"pid":null,"level":"info","source":"robocop","thread":null}
12-08 13:57:07.819 I/Robocop ( 2073): {"action":"log","message":"Failed: 0","time":1418075827817,"pid":null,"level":"info","source":"robocop","thread":null}
12-08 13:57:07.829 D/GeckoToolbar( 2073): onTabChanged: MENU_UPDATED
12-08 13:57:07.829 D/GeckoBrowserApp( 2073): BrowserApp.onTabChanged: 6: MENU_UPDATED
12-08 13:57:07.839 I/Robocop ( 2073): {"action":"log","message":"Todo: 0","time":1418075827824,"pid":null,"level":"info","source":"robocop","thread":null}
12-08 13:57:07.863 I/Robocop ( 2073): {"action":"log","message":"SimpleTest FINISHED","time":1418075827841,"pid":null,"level":"info","source":"robocop","thread":null}
12-08 13:57:07.890 I/GYB     ( 2073): endTest() done
12-08 13:57:07.890 I/GYB     ( 2073): sent Robocop:Quit
...
12-08 13:57:11.418 E/GeckoLibLoad( 2073): Loaded libs in 18251ms total, 5770ms(11760ms) user, 1570ms(4280ms) system, 0(0) faults
12-08 13:57:11.439 W/GeckoThread( 2073): zerdatime 2619680 - runGecko
...
12-08 13:57:26.999 D/GeckoBrowser( 2073): zerdatime 1418075846997 - browser chrome startup finished.
12-08 13:57:29.619 D/GeckoLayerClient( 2073): Screen-size changed to (1024,816)
12-08 13:57:29.619 D/GeckoLayerClient( 2073): Window-size changed to (1024,743)
12-08 13:57:29.669 I/Gecko   ( 2073): Attempting load of libEGL.so
...
12-08 13:57:34.009 I/GeckoDump( 2073): Robocop:Quit received -- requesting quit
12-08 13:57:34.050 I/Gecko   ( 2073): nsAppStartup::Quit
...
12-08 13:57:39.709 I/Gecko   ( 2073): ShutdownXPCOM...
...
12-08 13:57:41.223 D/GeckoHealthRec( 2073): Recording session end: P
12-08 13:57:41.278 D/GeckoAppShell( 2073): Killing via System.exit()
OS: Linux → Android
Hardware: x86_64 → All
Depends on: 1113751
Bug 1113751 is interesting and bug 1090927 has a similar issue: When a robocop test has more than one public testXXX function, the browser is closed at the end of the first test function, but the framework then tries to start up a new test for the next test function. Sometimes the shutdown happens cleanly, or sufficiently cleanly to avoid a test failure; sometimes there is a crash.

Private functions should not be a problem, but logs in bug 1090927 suggest they might be.

My sense is that these are isolated cases and this is not the issue causing most robo shutdown crashes, but it is responsible for some. Maybe a review of all robocop tests is in order?
Android 2.3 mochitest logs typically show shutdown in 5000 - 7000 ms, but sometimes are much longer:

http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2015/01/2015-01-12-12-57-06-mozilla-central-android-api-9/mozilla-central_ubuntu64_vm_mobile_test-mochitest-10-bm115-tests1-linux64-build6.txt.gz

16:19:53     INFO -  01-12 16:16:44.093 I/GeckoDump(  589): ⰲ겿{"action":"log","time":1421108204096,"thread":"","js_source":"TestRunner","pid":null,"source":"mochitest","level":"INFO","message":"SimpleTest FINISHED"}ⰲ겿
16:19:53     INFO -  01-12 16:16:46.467 W/AudioFlinger(   37): write blocked for 71 msecs, 1673 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:16:52.953 W/AudioFlinger(   37): write blocked for 75 msecs, 1676 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:17:18.053 W/AudioFlinger(   37): write blocked for 78 msecs, 1678 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:17:25.193 W/AudioFlinger(   37): write blocked for 69 msecs, 1679 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:17:31.183 I/Gecko   (  589): ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
16:19:53     INFO -  01-12 16:17:31.193 W/AudioFlinger(   37): write blocked for 73 msecs, 1680 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:17:37.963 D/GeckoIdleService(  589): Remove observer 0x54d24f90 (300 seconds), 0 remain idle
16:19:53     INFO -  01-12 16:17:38.254 W/AudioFlinger(   37): write blocked for 72 msecs, 1681 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:17:46.188 W/AudioFlinger(   37): write blocked for 81 msecs, 1683 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:18:10.323 W/AudioFlinger(   37): write blocked for 70 msecs, 1684 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:18:11.384 I/Gecko   (  589): WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"Telemetry: shutting down","state":"(none)","filename":"resource://gre/modules/TelemetryPing.jsm","lineNumber":1025,"stack":["resource://gre/modules/TelemetryPing.jsm:setupChromeProcess:1025","resource://gre/modules/TelemetryPing.jsm:Impl.observe:1229","resource://gre/modules/TelemetryPing.jsm:this.TelemetryPing<.observe:250","jar:jar:file:///data/app/org.mozilla.fennec-1.apk!/assets/omni.ja!/components/TelemetryStartup.js:TelemetryStartup.prototype.observe:24","null:null:0"]}] Barrier: profile-before-change2
16:19:53     INFO -  01-12 16:18:18.643 W/AudioFlinger(   37): write blocked for 72 msecs, 1685 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:18:30.293 W/AudioFlinger(   37): write blocked for 76 msecs, 1687 delayed writes, thread 0xc658
16:19:53     INFO -  01-12 16:18:32.554 I/Gecko   (  589): ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
16:19:53     INFO -  01-12 16:18:33.103 D/GeckoAppShell(  589): Killing via System.exit()

...that's over 105000 ms!

Of course, these are different tests and might behave differently than robocop, but my point is that we are having trouble with shutdown on Android 2.3 for other test types -- we just don't notice it because the harness waits for much longer (probably 300 seconds).
After a test, robocop requests browser shutdown and sleeps for 7 seconds. Usually the process exits during the 7 seconds and all is well. If not, robocop forces the process down, and that sometimes works fine (exits cleanly) and sometimes results in a [@ libui.so + 0x1befe] shutdown crash. This patch simply increases the 7 seconds to 120 seconds.

I would prefer to find a "real" solution: 1) determine and address the root cause of shutdown slowness on the emulator, or 2) force an immediate shutdown. I still don't even know if the difference between a 5 second shutdown and a 60 second shutdown is in Firefox, the emulator, or the aws host -- and I sometimes suspect them all! My attempts to call System.exit() or similar without waiting for a clean shutdown have all resulted in intermittent crashes. I might be able to ignore shutdown crashes (http://hg.mozilla.org/mozilla-central/annotate/54cdfbf0f13a/toolkit/crashreporter/nsExceptionHandler.cpp#l1604?) but am not quite sure how to do that.

try runs verify that this change does no harm. I suspect it will avoid most shutdown crashes, but I think we need to see it landed to determine how effective it is.
Attachment #8548295 - Flags: review?(mark.finkle)
Comment on attachment 8548295 [details] [diff] [review]
increase robocop shutdown wait from 7 s to 120 s

Maybe we can audit the tests and see if we can lower the 120seconds to something more reasonable. I mean, if the tests all seem to be closing in ~30seconds, we can lower the timeout and hopefully find real issues faster.
Attachment #8548295 - Flags: review?(mark.finkle) → review+
I haven't seen any robocop shutdown crashes on mozilla-central since increasing the shutdown time here.

:RyanVM -- Do you agree? Do you want this on aurora/beta?
Flags: needinfo?(ryanvm)
(In reply to Mark Finkle (:mfinkle) from comment #9)
> Maybe we can audit the tests and see if we can lower the 120seconds to
> something more reasonable. I mean, if the tests all seem to be closing in
> ~30seconds, we can lower the timeout and hopefully find real issues faster.

I checked about 50 robocop logcats. At least 80% completed shutdown in less than 10 seconds. 4 took about 30 seconds. 1 took about 60 seconds. 

I think I want to keep the 120 second timeout, in hopes of 0% shutdown crashes.
I'd like to see this on any branch that runs robocop. Seems like there's basically no risk in that?
Flags: needinfo?(ryanvm)
This change seems very low risk: Very limited scope and no sign of trouble on central since landing on 2015-01-14.
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: