Closed Bug 806371 Opened 12 years ago Closed 12 years ago

Android talos test rpr (testBrowserProviderPerf) killed prematurely

Categories

(Firefox for Android Graveyard :: General, defect)

x86
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

There are frequent test failures for Talos test rpr (testBrowserProviderPerf). Some are recorded in bug 796914. The test terminates abruptly, as though there is a crash, but no stack is reported. Tracing suggests the problem occurs in addTonsOfUrls -> bulkInsert -> getWritableDatabase. 10-29 06:38:43.457 D/GeckoBrowserProvider( 2440): Inserted ID in database: 475 10-29 06:38:43.457 V/GeckoBrowserProvider( 2440): Calling insert in transaction on URI: content://org.mozilla.fennec.db.browser/bookmarks?test=1 10-29 06:38:43.457 V/GeckoBrowserProvider( 2440): Insert on BOOKMARKS: content://org.mozilla.fennec.db.browser/bookmarks?test=1 10-29 06:38:43.457 V/GeckoBrowserProvider( 2440): Extracting image values for URI: 061a366c-66db-473e-abf7-685b47140e78 10-29 06:38:43.457 D/GeckoBrowserProvider( 2440): Inserting bookmark in database with URL: 061a366c-66db-473e-abf7-685b47140e78 10-29 06:38:43.457 V/GeckoBrowserProvider( 2440): Getting writable database for URI: content://org.mozilla.fennec.db.browser/bookmarks?test=1 10-29 06:38:43.477 I/ActivityManager( 1020): Process org.mozilla.fennec (pid 2440) has died. 10-29 06:38:43.477 I/WindowManager( 1020): WIN DEATH: Window{486cf2b0 org.mozilla.fennec/org.mozilla.fennec.App paused=false} 10-29 06:38:43.487 D/AndroidRuntime( 2432): Shutting down VM 10-29 06:38:43.487 I/Process ( 1020): Sending signal. PID: 2509 SIG: 9 10-29 06:38:43.487 I/WindowManager( 1020): WIN DEATH: Window{4867f2d0 SurfaceView paused=false} 10-29 06:38:43.487 W/ActivityManager( 1020): Crash of app org.mozilla.fennec running instrumentation ComponentInfo{org.mozilla.roboexample.test/org.mozilla.fennec.FennecInstrumentationTestRunner} 10-29 06:38:43.487 I/ActivityManager( 1020): Force stopping package org.mozilla.fennec uid=10033 10-29 06:38:43.487 I/ActivityManager( 1020): Force finishing activity HistoryRecord{4851f108 org.mozilla.fennec/.App} 10-29 06:38:43.487 W/ActivityManager( 1020): Duplicate finish request for HistoryRecord{4851f108 org.mozilla.fennec/.App}
Blocks: 796914
I see now! It's not so much a crash, but a kill: "Sending signal. PID: 2509 SIG: 9" I suspect talos is getting confused and killing Fennec mid-test. There is a timeout in ttest of 300 s; this test normally takes about 110 s. In the failure cases, it looks like we often only wait ~5 seconds, because it appears that the process has completed.
Summary: Crash during rpr (testBrowserProviderPerf) in getWritableDatabse → Android talos test rpr (testBrowserProviderPerf) killed prematurely
In the failure cases, ffprocess_remote.launchProcess only executes for ~5 s, whereas launchProcess executes for ~110 s in the success cases. In turn, this difference is seen in execution times for devicemanagerSUT.fireProcess. I suspect that sutAgent is confused by the "am" command line, but debugging the sutAgent is awkward in this case, since I can only seem to reproduce this bug on try. Also, it seems to me that sutAgent exec commands are not (and should not be) guaranteed to wait for process completion. There are safe-guards in devicemanager and ffprocess to wait for process completion after the process has been launched, but that code appears to be consistently failing -- I am looking into that now.
I determined that sutAgent was in fact waiting for process completion -- the process was ending abruptly because it was being killed. The kill was not originating from bcontroller/ffprocess/devicemanager -- they were all waiting for dm.fireProcess to complete. I noted that the kill was occurring while addTonsOfUrls was executing; I added sleeps before addTonsOfUrls, and the kill continued to occur during addTonsOfUrls. I suspected - not for the first time! - an OOM kill, but could not verify: onLowMemory was not called, but that's not conclusive one way or the other. I tried reducing the test's memory use by simply reducing the BATCH_SIZE in testBrowserProviderPerf.java.in. Immediately, try tests stopped showing the sudden kill on test startup. However, those same tests started failing intermittently with a SIGSEGV on test startup. We are currently seeing SIGSEGVs on startup in other tests and tracking that in other bugs, but not with this frequency. I then wondered if the SIGSEGVs were caused by an unrelated problem, so updated my build and pushed to try again: now I don't see the SIGSEGVs or the kills on try -- rpr is passing consistently. But now rpr is passing on tbpl also, without my change. Frustrating!!
The last 20 rpr runs on tbpl/m-i have been error-free. I guess there's nothing more to do here.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Blocks: 816584
Blocks: 829371
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.