Closed Bug 686245 Opened 13 years ago Closed 10 years ago

Intermittent Android "talosError: 'browser non-zero return code (1) [browser_output.txt]'"

Categories

(Testing :: Talos, defect)

ARM
Android
defect
Not set
normal

Tracking

(firefox16 affected, firefox17 affected, firefox18 affected, firefox19 affected)

RESOLVED WORKSFORME
Tracking Status
firefox16 --- affected
firefox17 --- affected
firefox18 --- affected
firefox19 --- affected

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [for FF19 on Beta, see comment 1441][mobile_unittests][mobile_dev_needed])

Attachments

(1 file)

Talos is of the opinion that it's your fault you are returning 1, and you should be more explicit about what your problem was. Of course, you have no idea why you are, or how to reproduce it, but that's your problem, now isn't it?

https://tbpl.mozilla.org/php/getParsedLog.php?id=6372769
Android Tegra 250 mozilla-central talos remote-tpan on 2011-09-11 16:35:29 PDT for push 569a45bfb71c

RETURN:s: tegra-043
RETURN:id:20110911161228
RETURN:<a href = "http://hg.mozilla.org/mozilla-central/rev/569a45bfb71c">rev:569a45bfb71c</a>
tegra-043: 
		Started Sun, 11 Sep 2011 16:45:38
Running test tpan: 
		Started Sun, 11 Sep 2011 16:45:38
reconnecting socket
FIRE PROC: 'org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/getInfo.html'
reconnecting socket
FIRE PROC: 'org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
reconnecting socket
pushing directory: /tmp/tmpAZX6ip/profile to /mnt/sdcard/tests/profile
	Screen width/height:1024/768
	colorDepth:24
	Browser inner width/height: 980/821

NOISE: __start_report137__end_report
NOISE: __startTimestamp1315759734736__endTimestamp
reconnecting socket
FIRE PROC: 'org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report93__end_report
NOISE: __startTimestamp1315759916663__endTimestamp
reconnecting socket
FIRE PROC: 'org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report99__end_report
NOISE: __startTimestamp1315760099158__endTimestamp
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
NOISE: 
NOISE: __FAILbrowser non-zero return code (1)__FAIL
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
Failed tpan: 
		Stopped Sun, 11 Sep 2011 16:59:31
FAIL: Busted: tpan
FAIL: browser non-zero return code (1)
Completed test tpan: 
		Stopped Sun, 11 Sep 2011 16:59:31
this is a clear sign the tegra is going away.  return code of '1' is not what the browser is returning, just which error handler in talos caught the error.  Some how we are causing the device to go offline, most likely by starving it of memory or killing off necessary processes (android does this when it is low on memory) and the network light goes off.
Whiteboard: [orange] → [orange][mobile_unittests][mobile_dev_needed]
https://tbpl.mozilla.org/php/getParsedLog.php?id=10259232&tree=Mozilla-Inbound
Whiteboard: [orange][mobile_unittests][mobile_dev_needed] → [orange][mobile_unittests][mobile_dev_needed][purple]
https://tbpl.mozilla.org/php/getParsedLog.php?id=11470740&tree=Mozilla-Inbound

Wonder how often robocheck2 actually manages to run.
https://tbpl.mozilla.org/php/getParsedLog.php?id=11504048&tree=Mozilla-Inbound

One in five or six times is how often robocheck2 runs, in case anyone else was curious.
I was curious about which tests are failing here, so reviewed Mozilla-Inbound logs reported during August:

tCheck:        7
tCheck2:       7
tCheck3:       7
tRoboProvider: 6
tPan:          5
Summary: Intermittent "browser non-zero return code (1)" running Talos on Android → Intermittent "browser non-zero return code (1) [browser_output.txt]" running Talos on Android
Depends on: 790602
https://tbpl.mozilla.org/php/getParsedLog.php?id=15381879&tree=Mozilla-Beta
tegra-232
Component: General → Talos
Product: Fennec → Testing
https://tbpl.mozilla.org/php/getParsedLog.php?id=15393087&tree=Cedar
tegra-253
Whiteboard: [orange][mobile_unittests][mobile_dev_needed][purple] → [orange][mobile_unittests][mobile_dev_needed][purple][no, red]
Summary: Intermittent "browser non-zero return code (1) [browser_output.txt]" running Talos on Android → Intermittent Android "talosError: 'browser non-zero return code (1) [browser_output.txt]'"
Depends on: 797324
vhttps://tbpl.mozilla.org/php/getParsedLog.php?id=15844624&tree=Mozilla-Inbound
Since I have my mobile talos hat on today, here's a thought:

Maybe we could consider adding some instrumentation to record the logcat when running the browser, and trying to grab it only in the case that we failed? (i.e. the "return code" is non-zero). I know we resisted displaying the logcat because the tbpl error parser would falsely conclude that there were errors from parsing it, but that doesn't matter if we're failing anyway, does it?
https://tbpl.mozilla.org/php/getParsedLog.php?id=15857795&tree=Mozilla-Inbound
tegra-078

Yeah, false positives while failing do still hurt - especially if you have a failure which isn't parsed (getting better, but that used to be about 99% of Android failures), then people will wrongly conclude that the false positive is the problem, and since it's known, they'll ignore the problem they are causing on Try and just go ahead and push.
https://tbpl.mozilla.org/php/getParsedLog.php?id=15891924&tree=Firefox
tegra-080
https://tbpl.mozilla.org/php/getParsedLog.php?id=15891914&tree=Firefox
tegra-257

Something in that Friday night deploy made this a whole lot more frequent in robocheck/robopan - it's a rare push that doesn't hit it once or twice now.
https://tbpl.mozilla.org/php/getParsedLog.php?id=15892033&tree=Firefox
tegra-223

Or three per push, if you taunt happy fun return code.
https://tbpl.mozilla.org/php/getParsedLog.php?id=15893729&tree=Firefox
tegra-035

Six, but since it caught the nightly that doesn't exactly count.
Whiteboard: [orange][mobile_unittests][mobile_dev_needed][purple][no, red] → [mobile_unittests][mobile_dev_needed][purple][no, red]
Joel, comment 6 seems to imply this is more of a device/harness issue than a Fennec bug; could I leave you as point to push this toporange to completion? :-)
Flags: needinfo?(jmaher)
this bug is actually a crash during startup as seen by the logcat dump in the full log.  After looking at 10 of the more recent logs in this bug, it is all java crash related.  

Redirecting to blassey for next steps on the mobile developer front to handle startup crashes.
Flags: needinfo?(jmaher) → needinfo?(blassey.bugs)
(In reply to Joel Maher (:jmaher) from comment #1279)
> this bug is actually a crash during startup as seen by the logcat dump in
> the full log.  After looking at 10 of the more recent logs in this bug, it
> is all java crash related.  
> 
> Redirecting to blassey for next steps on the mobile developer front to
> handle startup crashes.

joel, can you create a new bug with the crash stack? This one is useless with the 1280 comments
Flags: needinfo?(blassey.bugs)
Depends on: 825643
Depends on: 829588
Comment 1298 is a Java exception on m-i: 

01-09 10:25:59.103 I/TestRunner( 2555): java.lang.NullPointerException
01-09 10:25:59.103 I/TestRunner( 2555): 	at org.mozilla.fennec.tests.BaseTest.setUp(BaseTest.java:82)
01-09 10:25:59.103 I/TestRunner( 2555): 	at org.mozilla.fennec.tests.ContentProviderTest.setUp(ContentProviderTest.java:220)
01-09 10:25:59.103 I/TestRunner( 2555): 	at org.mozilla.fennec.tests.testBrowserProviderPerf.setUp(testBrowserProviderPerf.java:256)
01-09 10:25:59.103 I/TestRunner( 2555): 	at junit.framework.TestCase.runBare(TestCase.java:125)
01-09 10:25:59.103 I/TestRunner( 2555): 	at junit.framework.TestResult$1.protect(TestResult.java:106)
01-09 10:25:59.103 I/TestRunner( 2555): 	at junit.framework.TestResult.runProtected(TestResult.java:124)
01-09 10:25:59.103 I/TestRunner( 2555): 	at junit.framework.TestResult.run(TestResult.java:109)
01-09 10:25:59.103 I/TestRunner( 2555): 	at junit.framework.TestCase.run(TestCase.java:118)
01-09 10:25:59.103 I/TestRunner( 2555): 	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:169)
01-09 10:25:59.103 I/TestRunner( 2555): 	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:154)
01-09 10:25:59.103 I/TestRunner( 2555): 	at android.test.InstrumentationTestRunner.onStart(InstrumentationTestRunner.java:520)
01-09 10:25:59.103 I/TestRunner( 2555): 	at android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:1447)
Comment 1297 is a Java exception - bug 830557.

Comment 1293, 1292, and 1291 are classic examples of unexplained-crash-on-startup, on m-i and mozilla-release.

The other 40+ logs since 1293 are all on mozilla-beta.

Have the unexplained crashes been resolved on all branches but mozilla-beta, around January 7?
Depends on: 830557
Oh we might be in the clear.  I added code in mozdevice (which went to talos) yesterday which solves this problem from the harness side (specifically the mozilla-beta issues).
The last 11 failures (and many of the older ones) reported here have been "mozilla-beta talos remote-ts". These logs are not nearly as useful as what we  usually get from mozilla-inbound/central/etc these days...and I'm not sure what needs uplifting.
(In reply to Geoff Brown [:gbrown] from comment #1342)
> Comment 1298 is a Java exception on m-i: 
> 
> 01-09 10:25:59.103 I/TestRunner( 2555): java.lang.NullPointerException
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> org.mozilla.fennec.tests.BaseTest.setUp(BaseTest.java:82)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> org.mozilla.fennec.tests.ContentProviderTest.setUp(ContentProviderTest.java:
> 220)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> org.mozilla.fennec.tests.testBrowserProviderPerf.
> setUp(testBrowserProviderPerf.java:256)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> junit.framework.TestCase.runBare(TestCase.java:125)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> junit.framework.TestResult$1.protect(TestResult.java:106)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> junit.framework.TestResult.runProtected(TestResult.java:124)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> junit.framework.TestResult.run(TestResult.java:109)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> junit.framework.TestCase.run(TestCase.java:118)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:169)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:154)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> android.test.InstrumentationTestRunner.onStart(InstrumentationTestRunner.
> java:520)
> 01-09 10:25:59.103 I/TestRunner( 2555): 	at
> android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:
> 1447)

so, if I read that corectly, we have a null pointer exception on this line:
String rootPath = FennecInstrumentationTestRunner.getArguments().getString("deviceroot");

presumably that means that getArguments() is returning null. Let's try catching that and see what we get
(In reply to Geoff Brown [:gbrown] from comment #1431)
> The last 11 failures (and many of the older ones) reported here have been
> "mozilla-beta talos remote-ts". These logs are not nearly as useful as what
> we  usually get from mozilla-inbound/central/etc these days...and I'm not
> sure what needs uplifting.

I checked with :jmaher and he pointed me at bug 829588, which looks highly relevant!
We updated mozdevice on beta. The ts failures persist.

I forgot that talos has its own devicemanager: The change that might help here is updating talos.json / talos.zip.

Currently we have:
m-c: talos.8dd79e0f1d4c.zip
m-a: talos.6de61e505cf3.zip
m-b: talos.06465e978da7.zip

:jmaher -- should we update m-b to talos.6de61e505cf3.zip?
Flags: needinfo?(jmaher)
the only problem with updating talos on m-b is that I have a hacked up custom version up there to account for some telemetry prefs which values are incorrect on older versions of firefox (<20).  Look at https://bugzilla.mozilla.org/show_bug.cgi?id=828752.

We can discuss this today, but most likely we will need to update talos in another hacked custom way, or just wait until the next uplift when firefox-20 hits beta.
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #1441)
> or just wait until the next uplift when
> firefox-20 hits beta.

Let's do that: ride the trains (expect frequent ts failures on beta until then).
Whiteboard: [mobile_unittests][mobile_dev_needed][purple][no, red] → [mobile_unittests][mobile_dev_needed][purple][no, red][for FF19 on Beta, see comment 1441]
(In reply to Geoff Brown [:gbrown] from comment #1442)
> (In reply to Joel Maher (:jmaher) from comment #1441)
> > or just wait until the next uplift when
> > firefox-20 hits beta.
> 
> Let's do that: ride the trains (expect frequent ts failures on beta until
> then).

Those last six? Those would be 20 on beta, 0 for 6 so far.
Whiteboard: [mobile_unittests][mobile_dev_needed][purple][no, red][for FF19 on Beta, see comment 1441] → [for FF19 on Beta, see comment 1441][mobile_unittests][mobile_dev_needed]
this would use the talos from mozilla-central, this should already be fixed on aurora as well.
Attachment #717889 - Flags: review?(ryanvm) → review+
Closing bugs where TBPLbot has previously commented, but have now not been modified for >3 months & do not contain the whiteboard strings for disabled/annotated tests or use the keyword leave-open. Filter on: mass-intermittent-bug-closure-2014-07
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: