Closed Bug 991262 Opened 6 years ago Closed 6 years ago

Intermittent issue with blank homescreen during gaia-ui test

Categories

(Core Graveyard :: DOM: Apps, defect, major)

ARM
Gonk (Firefox OS)
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: bsilverberg, Unassigned)

References

Details

Attachments

(7 files)

Attached file Logcat of the run
This issue appears when running test_homescreen_delete_app.py multiple times. At some point (often the 2nd run), when the test starts the homescreen on the device remains blank, just showing the background. The test waits for quite some time and then fails with the following message(s):

TEST-START test_homescreen_delete_app.py
test_delete_app (test_homescreen_delete_app.TestDeleteApp) ...
ERROR

======================================================================
ERROR: None
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/bsilverberg/.virtualenvs/gaia/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette_test.py", line 145, in run
    self.setUp()
  File "/Users/bsilverberg/gitRepos/gaia/tests/python/gaia-ui-tests/gaiatest/tests/functional/homescreen/test_homescreen_delete_app.py", line 20, in setUp
    self.apps.switch_to_displayed_app()
  File "/Users/bsilverberg/gitRepos/gaia/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 74, in switch_to_displayed_app
    self.marionette.switch_to_frame(self.displayed_app.frame)
  File "/Users/bsilverberg/.virtualenvs/gaia/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette.py", line 895, in switch_to_frame
    response = self._send_message('switchToFrame', 'ok', element=frame.id, focus=focus)
  File "/Users/bsilverberg/.virtualenvs/gaia/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette.py", line 604, in _send_message
    "Connection timed out", status=ErrorCodes.TIMEOUT)
TimeoutException: TimeoutException: Connection timed out


Most recent errors/exceptions are:

04-02 15:41:37.180 I/Gecko   ( 1267): 1396467697194	Marionette	INFO	sendToClient: {"from":"0","error":{"message":"Unable to locate element: div.appWindow.active","status":7,"stacktrace":null}}, {3b555fd2-7ddf-492d-b7b9-a7555065cec0}, {3b555fd2-7ddf-492d-b7b9-a7555065cec0}
04-02 15:41:37.340 I/Gecko   ( 1267): 1396467697356	Marionette	INFO	sendToClient: {"from":"0","error":{"message":"Unable to locate element: div.appWindow.active","status":7,"stacktrace":null}}, {1ff09116-47c9-4199-b41a-9627a6a111cd}, {1ff09116-47c9-4199-b41a-9627a6a111cd}
04-02 15:41:37.620 I/Gecko   ( 1267): 1396467697630	Marionette	INFO	sendToClient: {"from":"0","error":{"message":"Unable to locate element: div.appWindow.active","status":7,"stacktrace":null}}, {8c505fe1-5e63-46cb-8f35-3b083190d405}, {8c505fe1-5e63-46cb-8f35-3b083190d405}
04-02 15:41:38.580 I/Gecko   ( 1267): A coding exception was thrown and uncaught in a Task.
04-02 15:41:38.580 I/Gecko   ( 1267): Full message: TypeError: aIdsApp is undefined
TEST-UNEXPECTED-FAIL | test_homescreen_delete_app.py test_homescreen_delete_app.TestDeleteApp.test_delete_app | The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.
----------------------------------------------------------------------
Ran 1 test in 394.265s

FAILED (errors=1)

After this the homescreen is unresponsive to taps and flicks, but if you tap the home button a few times the homescreen appears, but with overly large icons.

See Stephen's video of this at https://www.youtube.com/watch?v=124hCO_EOpY&feature=youtu.be

I also reproduced this locally and am attaching the logcat from the run on my device which is a Hamachi, with the following build:

Gaia      6274b3645b3d41df4aa5a55398b5d106edf1c528
Gecko     https://hg.mozilla.org/mozilla-central/rev/4941a2ac0786
BuildID   20140402040201
Version   31.0a1
ro.build.version.incremental=eng.archermind.20131114.105818
ro.build.date=Thu Nov 14 10:58:33 CST 2013
Jonathan, would you mind having someone from your team investigate this from the (potential) Marionette side, and/or help us figure out where to go next?  Thx :-)
Flags: needinfo?(jgriffin)
Severity: normal → major
OS: Mac OS X → Gonk (Firefox OS)
Hardware: x86 → ARM
Attached file about-memory-0.tar.gz
this was obtained by running MOZ_IGNORE_NUWA_PROCESS=1 ./get_about_memory.py right before the bug occurred.
There's an error being triggered here:

04-02 15:41:38.580 I/Gecko   ( 1267): *************************
04-02 15:41:38.580 I/Gecko   ( 1267): A coding exception was thrown and uncaught in a Task.
04-02 15:41:38.580 I/Gecko   ( 1267):
04-02 15:41:38.580 I/Gecko   ( 1267): Full message: TypeError: aIdsApp is undefined
04-02 15:41:38.580 I/Gecko   ( 1267): Full stack: this.OperatorAppsRegistry._installOperatorApps/<@resource://gre/modules/OperatorApps.jsm:317:23
04-02 15:41:38.580 I/Gecko   ( 1267): TaskImpl_run@resource://gre/modules/Task.jsm:298:1
04-02 15:41:38.580 I/Gecko   ( 1267): resolve@resource://gre/modules/commonjs/sdk/core/promise.js:118:11
04-02 15:41:38.580 I/Gecko   ( 1267): then@resource://gre/modules/commonjs/sdk/core/promise.js:43:43
04-02 15:41:38.580 I/Gecko   ( 1267): resolve@resource://gre/modules/commonjs/sdk/core/promise.js:185:11
04-02 15:41:38.580 I/Gecko   ( 1267): TaskImpl_run@resource://gre/modules/Task.jsm:307:9
04-02 15:41:38.580 I/Gecko   ( 1267): Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:748:11
04-02 15:41:38.580 I/Gecko   ( 1267): this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:627:7
04-02 15:41:38.580 I/Gecko   ( 1267):
04-02 15:41:38.580 I/Gecko   ( 1267): *************************

Moving this bug over to gecko.
Component: Gaia::UI Tests → DOM: Apps
Flags: needinfo?(jgriffin)
Product: Firefox OS → Core
Sounds like a dupe of bug 991246.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 991246
(In reply to Myk Melez [:myk] [@mykmelez] from comment #4)
> Sounds like a dupe of bug 991246.
> 
> *** This bug has been marked as a duplicate of bug 991246 ***

Although our failure has gwagner's filed bug in it, as well, he says it "shouldn't be a problem...always shows up during startup, but works."

I'm cautiously reopening so we can see, post-bug 991246 landing, if it still happens.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Bob/team, does this still happen?  I note there's also bug 992589 still out.
This does not reproduces for me. Tested on the latest Hamachi build:

Gaia      650e8c2c611ed07495d3bf3769f44a0efd88a492                         
Gecko     https://hg.mozilla.org/mozilla-central/rev/5811efc11011          
BuildID   20140408160203                                                   
Version   31.0a1
I just reproduced it on the second run of test_homescreen_delete_app with today's build:

Gaia      650e8c2c611ed07495d3bf3769f44a0efd88a492
Gecko     https://hg.mozilla.org/mozilla-central/rev/5811efc11011
BuildID   20140408160203
Version   31.0a1
ro.build.version.incremental=eng.archermind.20131114.105818
ro.build.date=Thu Nov 14 10:58:33 CST 2013
I saw it too on the same build from Comment 8.  TestCleanupGaia.test_cleanup_gaia fails in b2g.hamachi.mozilla-central.unittests fails because of this I believe.
This can also be reproduced when test_killall.py in the unittest folder is executed alone.  The test may complete successfully, but afterwards, the phone shows no icons.

Phone contained this version:
Gaia      553da99ab09b6b894d9f95bb06b16b6e1ddbf0a1
Gecko     https://hg.mozilla.org/mozilla-central/rev/5b6e82e7bbbf
BuildID   20140415040201
Version   31.0a1
ro.build.version.incremental=eng.tclxa.20131223.163538
ro.build.date=Mon Dec 23 16:36:04 CST 2013
Also, test_cleanup_gaia (unit test) fails (bug 997552) because of this bug.  When the script is executed , the homescreen hangs upon restart.
Blocks: 997552
Can you get a dmesg at the same point in the process that the b2g-info from comment 12 was taken?
Attached file adb shell b2g-info.txt
This is from a single, recent run of test_cleanup_gaia.py, directly after a branch-new flash.
Flags: needinfo?(bkelly)
Attached file adb shell dmesg.txt
This is the corresponding adb shell dmesg output of that recent, single run of test_cleanup_gaia.py, after a fresh flash of today's Hamachi m-c/master build.
I noticed that the end of the first test_homescreen_delete_app.py leaves the homescreen in the "move-or-kill app" mode.  The next test restart then fails to launch homescreen.

If you manually enter the "move-or-kill app" mode, then similarly the next test restart fails to launch homescreen.

I also say bug 998438 in "move-or-kill app" mode, but not sure if its related.
Ok, I can work around the problem by adding a time.sleep(5) here:

  https://github.com/mozilla-b2g/gaia/blob/master/tests/python/gaia-ui-tests/gaiatest/gaia_test.py#L722

It seems that perhaps the code to wait for the system to be ready is no longer adequate.  This could be related to the major window manager refactor in bug 907013.

Alive, what do you think of the code at the above link?  Is it still correct for trying to block tests until the system is ready?
Flags: needinfo?(bkelly) → needinfo?(alive)
See Also: → app-window-manager
See Also: → 990580
The 'appWindow.active' is the first opened app which should be FTU or homescreen.
I need to know what does 'ready' mean for test so I can't tell more here.
What does test really do after it thinks device is 'ready'?
Flags: needinfo?(alive)
Blocks: 987760
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #20)
> The 'appWindow.active' is the first opened app which should be FTU or
> homescreen.
> I need to know what does 'ready' mean for test so I can't tell more here.
> What does test really do after it thinks device is 'ready'?

9/10 the first thing a test does is launch an app so we need to wait until we can use the AppWindowManager safely to launch an app. but also it's good to wait for the RIL and other processes to have started up. This is much more complex than the js tests are dealing with because the device is slower and more hardware and diverse test cases. desktopb2g doesn't have these dependencies so it's a lot harder to deal with on the device.

I don't think bug 907013 is to blame because I only put the appWindow.active wait into gaiatest (linked to in comment#19) after that was merged by Alive.

This wait ( appWindow.active == displayed) used to only resolve when the FTU or homescreen was showing but now I see it resolving even when the Firefox OS animation is still present. The timing has definitely changed here recently, it could be several things.

I don't mind changing what we wait for here provided it is reliable. but we can't really wait for a Javascript event because there are too many spurious javascript errors in the same timeframe that Marionette catches instead. Which is why we are using appWindow.active, it's at the DOM level and not related to javascript errors.
No longer blocks: 987760
Have we tried the patch in comment 19 yet? I'm also wondering if we can get around this by changing the selector a bit? Are we positive that we are getting the right appWindow? We could also try something like:

locator = (By.CSS_SELECTOR, 'div.appWindow.active.homescreen iframe')
(In reply to Kevin Grandon :kgrandon from comment #22)
> Have we tried the patch in comment 19 yet? I'm also wondering if we can get
> around this by changing the selector a bit? Are we positive that we are
> getting the right appWindow? We could also try something like:
> 
> locator = (By.CSS_SELECTOR, 'div.appWindow.active.homescreen iframe')

:mdas and/or :njpark, can we get help with trying the patch, first, and then -- if needed -- the other suggestion(s) here?  Thanks!
Flags: needinfo?(npark)
(In reply to Kevin Grandon :kgrandon from comment #22)
> Have we tried the patch in comment 19 yet? I'm also wondering if we can get
> around this by changing the selector a bit? Are we positive that we are
> getting the right appWindow? We could also try something like:
> 
> locator = (By.CSS_SELECTOR, 'div.appWindow.active.homescreen iframe')

This will fail because the first app you get is actually the FTU. but it depends on the build (eg NOFTU=1) you're running too so we need to keep this flexible. Waiting for the first app, whether it be FTU or homescreen, has in the past been a pretty reliable way to know we're OK to start the setup.
One thing perhaps worth trying is to wait for the splash screen element to have cleared, then wait for the appWindow.active.render.

If you go adding sleeps please add them behind a device-only condition so we're not just adding unnecessary test duration to Travis and TBPL.
(In reply to Stephen Donner [:stephend] from comment #23)
> (In reply to Kevin Grandon :kgrandon from comment #22)
> > Have we tried the patch in comment 19 yet? I'm also wondering if we can get
> > around this by changing the selector a bit? Are we positive that we are
> > getting the right appWindow? We could also try something like:
> > 
> > locator = (By.CSS_SELECTOR, 'div.appWindow.active.homescreen iframe')
> 
> :mdas and/or :njpark, can we get help with trying the patch, first, and then
> -- if needed -- the other suggestion(s) here?  Thanks!

Adding extra sleep in line 722 didn't work, and when the locator value is changed, the script failed with nosuchelementfound exception, because strangely, when that change is added, the screen is in FTE, not showing the homescreen.  this is in start_b2g, and clearing the FTE flag happens in cleanup_gaia method, which gets invoked after start_b2g is completed.

Also, test_homescreen_delete_app.py fails consistently now, as well as test_cleanup_gaia.py
Flags: needinfo?(npark)
(In reply to npark from comment #25)
> (In reply to Stephen Donner [:stephend] from comment #23)
> > (In reply to Kevin Grandon :kgrandon from comment #22)
> > > Have we tried the patch in comment 19 yet? I'm also wondering if we can get
> > > around this by changing the selector a bit? Are we positive that we are
> > > getting the right appWindow? We could also try something like:
> > > 
> > > locator = (By.CSS_SELECTOR, 'div.appWindow.active.homescreen iframe')
> > 
> > :mdas and/or :njpark, can we get help with trying the patch, first, and then
> > -- if needed -- the other suggestion(s) here?  Thanks!
> 
> Adding extra sleep in line 722 didn't work, and when the locator value is
> changed, the script failed with nosuchelementfound exception, because
> strangely, when that change is added, the screen is in FTE, not showing the
> homescreen.  this is in start_b2g, and clearing the FTE flag happens in
> cleanup_gaia method, which gets invoked after start_b2g is completed.
> 
> Also, test_homescreen_delete_app.py fails consistently now, as well as
> test_cleanup_gaia.py

Oh, comment 24 explains exactly why it happens.
It turns out you need to add sleep of minimum 25 seconds here:
 https://github.com/mozilla-b2g/gaia/blob/master/tests/python/gaia-ui-tests/gaiatest/gaia_test.py#L717.

I measured the time it takes between when self.marionette.start_session() gets invoked until FTE is shown, by manually pausing the script by adding raw_input"Pausing Here" command after start_session().  on today's latest master, it takes 25-26 seconds before it shows FTE screen.

The odd thing is, during the bootup, it shows "based on mozilla technology" logo twice.  first with black background, then a moment of dark blank screen, then with white background.  After that, FTE shows up.  I thought it used to just show once.  

When I added sleep of 27 seconds, it was passing consistently.  This is after removing sleep(10) inside the setUp() in gaia_test.py for 990580.

There is a possible race condition here, probably with something that either start_b2g or setUp() invokes.
Hey Alive - We were investigating this in SF today, but were wondering if you could take over as it's getting late in the day.

Basically there is a race condition between FTU, app window manager/factory, and the runapp CLI command. This is only really visibile when running automated tests, or using the runapp CLI command. Normally I'd say that the tests should just be fixed to use the interface like a normal human, but due to the fact that we have this documented CLI command, we should ensure it works properly.

I started working on a patch here, but it's not quite finished, and I'm sure you might have some better ways of doing things: https://github.com/mozilla-b2g/gaia/pull/18509

Please let me know if you can take this, otherwise I might give it another shot again tomorrow. Thanks!
Flags: needinfo?(alive)
Thanks Kevin and Alive.. Doing this in Gaia will help when we move the js tests onto device, too!

I do feel like something has regressed here in the last 2-3 weeks or so, but hard to see what it is. If we can shore it up via a patch like this, I'd be happy.
(In reply to Zac C (:zac) from comment #24)

> One thing perhaps worth trying is to wait for the splash screen element to
> have cleared, then wait for the appWindow.active.render.

I just experimented with this and the #os-logo element behaves like this; 1. not in dom, 2. in dom 3. 'hide' class attached 4. deleted from dom.
Waiting for that is a recipe for unreliability and #1 and #4 match each other, so this option is out.
I am trapped by other bugs today. Will take this tomorrow.
I think this was resolved by the resolution of bug 990580.

Kevin/Alive do you think there is more work here that can be done long term? should we break it out into another bug, or re-title this one?
If it's working for you guys I'm fine with closing this for now. I'm not convinced that all of the edge cases have been solved, but if there are remaining issues please reopen this bug.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Flags: needinfo?(alive)
Resolution: --- → WORKSFORME
My money's on bug 992752 - the timeframe, the changes, and the problem as described seem to fit us, precisely, here.
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.