Closed Bug 790102 Opened 12 years ago Closed 11 years ago

Frequent trobocheck, trobopan "talosError: 'GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]'"

Categories

(Testing :: Talos, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla19

People

(Reporter: RyanVM, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 2 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=15109869&tree=Mozilla-Inbound

Android Tegra 250 mozilla-inbound talos remote-trobocheck3 on 2012-09-10 13:44:15 PDT for push cf39151876dc
slave: tegra-076

getting files in '/mnt/sdcard/tests/profile/minidumps/'
Failed tcheck3: 
		Stopped Mon, 10 Sep 2012 13:51:07
Traceback (most recent call last):
  File "run_tests.py", line 250, in run_tests
    talos_results.add(mytest.runTest(browser_config, test))
  File "/builds/tegra-076/talos-data/talos/ttest.py", line 378, in runTest
    test_results.add(browser_log_filename, counter_results=counter_results)
  File "/builds/tegra-076/talos-data/talos/results.py", line 119, in add
    results = BrowserLogResults(filename=results, counter_results=counter_results, global_counters=self.global_counters).results()
  File "/builds/tegra-076/talos-data/talos/results.py", line 309, in __init__
    self.error(match.group(1))
  File "/builds/tegra-076/talos-data/talos/results.py", line 324, in error
    raise utils.talosError(message)
talosError: 'GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]'
Traceback (most recent call last):
  File "run_tests.py", line 298, in <module>
FAIL: Busted: tcheck3
FAIL: GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]
    main()
  File "run_tests.py", line 295, in main
    run_tests(parser)
  File "run_tests.py", line 259, in run_tests
    raise e
utils.talosError: 'GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]'
program finished with exit code 1
elapsedTime=206.847420
TinderboxPrint:<a href = "http://hg.mozilla.org/integration/mozilla-inbound/rev/cf39151876dc">rev:cf39151876dc</a>

TinderboxPrint:FAIL: Busted: tcheck3

TinderboxPrint:FAIL: GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]
I see 2 things going wrong here, for reasons unknown:
 - the Gecko:Ready event is not received before the timeout, but there is no obvious hang; 
 - setDrawListener throws an NPE here:

37     public void setDrawListener(GeckoLayerClient.DrawListener listener) {
38         mGeckoApp.getLayerView().getLayerClient().setDrawListener(listener);
39     }

I think either getLayerView() or getLayerClient() must have returned null.
https://tbpl.mozilla.org/php/getParsedLog.php?id=16140776&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16140868&tree=Mozilla-Inbound
Summary: Intermittent trobocheck3 FAIL: GeckoEventExpecter: blockForEvent timeout: Gecko: Ready [browser_output.txt] → Intermittent trobocheck "talosError: 'GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]'"
This has suddenly picked up frequency, not sure why.

https://tbpl.mozilla.org/php/getParsedLog.php?id=16153993&tree=Mozilla-Inbound
I assume something landed in the last day or so which caused this problems.  I saw this on a try server run as well, was sort of odd.
https://tbpl.mozilla.org/php/getParsedLog.php?id=16174261&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16174181&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16174216&tree=Mozilla-Inbound
Summary: Intermittent trobocheck "talosError: 'GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]'" → Intermittent trobocheck, trobopan "talosError: 'GeckoEventExpecter: blockForEvent timeout: Gecko:Ready [browser_output.txt]'"
https://tbpl.mozilla.org/php/getParsedLog.php?id=16178747&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16178709&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16178885&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16179041&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16179066&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16180313&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16180389&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16180410&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16181240&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16181247&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16181171&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16181480&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16182938&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16182816&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16182531&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16182639&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16183562&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16183627&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16183828&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16183816&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16183793&tree=Mozilla-Inbound
Some notes:
- this seems to have picked up significantly on inbound only, as of bug 800486 [1]. almost every pushed cset has one or more of rck, rck2, or rck3 failing
- even after bug 800486 was merged from inbound to central, there do not appear to have been any occurences of this on central.
- i re-triggered rck, rck2, and rck3 on the push immediately prior to bug 800486 and it happened there as well (although it looks like less frequently, only 1 of 9)
- i pushed a backout of bug 800486 to try [3] and it appears to still be happening (so far 1/3, 6 more pending)

[1] https://tbpl.mozilla.org/?rev=bdb3e55bddda&tree=Mozilla-Inbound
[2] https://tbpl.mozilla.org/?rev=a10052ea8e72&tree=Mozilla-Inbound
[3] https://tbpl.mozilla.org/?tree=Try&rev=0017b32d9d84
I also retriggered rck, rck2, and rck3 on an inbound push a few csets prior to bug 800486 [1] and it didn't happen at all (0 of 9). So it does look like something that was introduced by bug 800486, except that backing it out doesn't look like it helps and it's not happening on central. I don't understand...

[1] https://tbpl.mozilla.org/?rev=7b582c09463b&tree=Mozilla-Inbound
Further update: my backout try build still had 4/9 failures, so it definitely doesn't help (at leas by itself).
https://tbpl.mozilla.org/php/getParsedLog.php?id=16198148&tree=Firefox - so probably it did make it to mozilla-central on schedule, but the lack of runs there and coincidence made it seem like it didn't.
We should try to figure out what the problem is and fix it rather than just find stuff to back out.
Good point, it was a mistake on my part to hide them on mozilla-inbound, unhidden.

So, we've got failure so bad that for any platform other than poor hapless Android, I would have closed mozilla-inbound yesterday afternoon. Now what?
https://tbpl.mozilla.org/?fromchange=a8a4f26cbb78&tochange=bdb3e55bddda&tree=Mozilla-Inbound&jobname=robo (and the stuff below, too) is making me start to think we might have broken it three times that day: once that was fixed by a backout, then again but only slightly, thus the one or less per push rate, then nearly completely.
I'm sick of pasting logs, so just assume it happens for 90% or more of the runs that happen.
(In reply to Geoff Brown [:gbrown] from comment #1)
> I see 2 things going wrong here, for reasons unknown:
>  - the Gecko:Ready event is not received before the timeout, but there is no
> obvious hang; 
>  - setDrawListener throws an NPE here:
> 
> 37     public void setDrawListener(GeckoLayerClient.DrawListener listener) {
> 38        
> mGeckoApp.getLayerView().getLayerClient().setDrawListener(listener);
> 39     }
> 
> I think either getLayerView() or getLayerClient() must have returned null.

Geoff, can we add some instrumentation to the Java code to try to narrow down what's going on?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #35)
> Geoff, can we add some instrumentation to the Java code to try to narrow
> down what's going on?

Yes -- I have a try run building now.

BTW, I no longer see that NPE -- the logs for the last couple of days look different.
My first attempts to add logging failed on try: I reproduced the failure, but the logging was lost because subsequent operations wrote too much to logcat (we only collect about the last 500 lines of logcat). eg https://tbpl.mozilla.org/?tree=Try&rev=8200248f1eda

I tried removing code after the failure (page load, etc), but then could not reproduce the failure! eg. https://tbpl.mozilla.org/?tree=Try&rev=1d44f845b425

Next up: Detect the failure and exit immediately, so that relevant logcat is preserved on try and/or reproduce locally. I will work on this today.
We noticed that these failures started with the recent talos update (first on m-i, then on m-c). One of the changes was to pass crashreporter env variables when starting robocop: 

FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" am instrument -w -e deviceroot /mnt/sdcard/tests -e class org.mozilla.fennec.tests.testCheck org.mozilla.roboexample.test/org.mozilla.fennec.FennecInstrumentationTestRunner'


In local testing, :jmaher confirms that the failures appear to be resolved if the crashreporter env is not passed.
(In reply to Geoff Brown [:gbrown] from comment #38)
> In local testing, :jmaher confirms that the failures appear to be resolved
> if the crashreporter env is not passed.

...oops, maybe not - we are double checking on that.
Assignee: nobody → gbrown
Blocks: 803408
The three robochecks and robopan are now hidden on all trunk trees.
(In reply to Geoff Brown [:gbrown] from comment #39)
> (In reply to Geoff Brown [:gbrown] from comment #38)
> > In local testing, :jmaher confirms that the failures appear to be resolved
> > if the crashreporter env is not passed.
> 
> ...oops, maybe not - we are double checking on that.

In my tests, the failures continue when the crashreporter env is not passed: https://tbpl.mozilla.org/?tree=Try&rev=2697523b3f35&noignore=1
therefore we are screwed!
I finally managed to get some reasonable logging for the failure. 

The issue is that the Gecko:Ready event is missed: it is sent before robocop starts waiting for it. We launch Fennec from the test setUp(), then initialize various Robocop objects, then start running the test which waits for Gecko:Ready. It looks to me like Gecko:Ready is being sent much sooner than it was, say, a few months ago. Also, the robocop initialization likely takes longer: we read more parameters from our configuration file now than we used to.

It is difficult to move the wait for Gecko:Ready much earlier than it already is. It can be moved to the end of setUp, but that does not gain much time. Before that, the object that we use to wait for events has not been created yet...but I am trying to re-arrange the initialization to get around that.
If these efforts fail, I have 2 other ideas:
 - save the Gecko:Ready indication in Fennec; before waiting for Gecko:Ready, query Fennec to see if Gecko:Ready has already been sent
 - allow/expect the wait for Gecko:Ready to fail and continue, perhaps with a warning
Attached patch patch for helper function (obsolete) — — Splinter Review
Attachment #675244 - Flags: review?(gbrown)
Comment on attachment 675244 [details] [diff] [review]
patch for helper function

Review of attachment 675244 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/tests/BaseTest.java.in
@@ +40,5 @@
>      private String mLogFile;
>      protected String mProfile;
>  
> +    protected void blockForGeckoReady() {
> +      Class appsCls = classLoader.loadClass("org.mozilla.gecko.GeckoApp");

Compile error here: classLoader not defined

@@ +47,5 @@
> +      Object states[] =  launchStateCls.getEnumConstants();
> +      Boolean ret = (Boolean)checkLaunchState.invoke(null, states[3]);
> +      if (!ret.booleanValue()) {
> +	mActions.expectGeckoEvent("Gecko:Ready").blockForEvent();
> +      }

To be absolutely sure, how about:

      expecter = mActions.expectGeckoEvent("Gecko:Ready");
      Boolean ret = (Boolean)checkLaunchState.invoke(null, states[3]);
      if (!ret.booleanValue()) {
	expecter.blockForEvent();
      }

instead?
Attached patch patch to use helper function — — Splinter Review
Attachment #675246 - Flags: review?(gbrown)
Attached patch patch for helper function (obsolete) — — Splinter Review
Attachment #675244 - Attachment is obsolete: true
Attachment #675244 - Flags: review?(gbrown)
Attachment #675249 - Flags: review?(gbrown)
Attachment #675249 - Attachment is patch: true
Attachment #675246 - Flags: review?(gbrown) → review+
Comment on attachment 675249 [details] [diff] [review]
patch for helper function

Review of attachment 675249 [details] [diff] [review]:
-----------------------------------------------------------------

Still doesn't compile!

...plus uncaught exceptions. I'll post an update....

::: mobile/android/base/tests/BaseTest.java.in
@@ +40,5 @@
>      private String mLogFile;
>      protected String mProfile;
>  
> +    protected void blockForGeckoReady() {
> +      Actions.EventExpecter geckoReadyExpector = mActions.expectGeckoEvent("Gecko:Ready")

Missing ;!

@@ +44,5 @@
> +      Actions.EventExpecter geckoReadyExpector = mActions.expectGeckoEvent("Gecko:Ready")
> +      ClassLoader classLoader = getActivity().getClassLoader();
> +      Class appsCls = classLoader.loadClass("org.mozilla.gecko.GeckoApp");
> +      Class launchStateCls = classLoader.loadClass("org.mozilla.gecko.GeckoApp$LaunchState");
> +      Method checkLaunchState =  appsCls.getMethod("checkLaunchState", launchStateCls);

You need to:
import java.lang.reflect.Method;
Attachment #675249 - Flags: review?(gbrown) → review-
Attached patch patch for helper function — — Splinter Review
This compiles!
Attachment #675249 - Attachment is obsolete: true
Attachment #675252 - Flags: review?(blassey.bugs)
Attachment #675252 - Flags: review?(blassey.bugs) → review+
Local test runs succeed. There's also a pending try run here: https://tbpl.mozilla.org/?tree=Try&rev=9bf252b0a454
https://hg.mozilla.org/mozilla-central/rev/13455917d72c
https://hg.mozilla.org/mozilla-central/rev/a63eeede2d15
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Whiteboard: [orange]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 829711
8 / 8 failures on tegra-115
Depends on: tegra-115
Sure, that's vaguely suggestive of the chance of a problem with that tegra... :)
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Oh look, it's tegra-115 again.
Oh, look, I filed bug 832052 just for tegra-115, too.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: