Closed Bug 990580 Opened 6 years ago Closed 6 years ago

Tests are failing; app freezes after being launched

Categories

(Firefox OS Graveyard :: General, defect, critical)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:2.0+)

RESOLVED WORKSFORME
blocking-b2g 2.0+

People

(Reporter: RobertC, Unassigned)

References

Details

(Keywords: qablocker, regression)

Attachments

(8 files, 3 obsolete files)

Tests on Jenkins are failing because of a timeout in the app launch method. The timeout occurs when trying to switch to the frame of the newly opened app.

This issue is not reproduced locally with manual or automated testing.

Stacktrace:
06:42:52 ERROR: None
06:42:52 ----------------------------------------------------------------------
06:42:52 Traceback (most recent call last):
06:42:52   File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/.env/local/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette_test.py", line 163, in run
06:42:52     testMethod()
06:42:52   File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/tests/functional/keyboard/test_keyboard.py", line 19, in test_keyboard_basic
06:42:52     contacts_app.launch()
06:42:52   File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/apps/contacts/app.py", line 32, in launch
06:42:52     Base.launch(self)
06:42:52   File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/apps/base.py", line 25, in launch
06:42:52     self.app = self.apps.launch(self.name, launch_timeout=launch_timeout)
06:42:52   File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 60, in launch
06:42:52     self.marionette.switch_to_frame(app.frame_id)
06:42:52   File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/.env/local/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette.py", line 895, in switch_to_frame
06:42:52     response = self._send_message('switchToFrame', 'ok', element=frame.id, focus=focus)
06:42:52   File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/.env/local/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette.py", line 604, in _send_message
06:42:52     "Connection timed out", status=ErrorCodes.TIMEOUT)
06:42:52 TimeoutException: TimeoutException: Connection timed out
06:42:52 
06:42:52 
06:42:52 Most recent errors/exceptions are:
06:42:52 
06:42:52 04-01 06:36:45.129 E/GeckoConsole( 6688): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:397 in loadSVConfFileError: Failed parsing singleVariant configuration file [js/singlevariantconf.json]:  [Exception... "File error: Not found"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: app://homescreen.gaiamobile.org/gaia_build_defer_index.js :: loadFile :: line 389"  data: no]
06:42:52 04-01 06:36:47.209 E/GeckoConsole( 6688): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:60 in retrieve: Got an exception when trying to load icon "app://music2.gaiamobile.org/style/icons/60/Music.png +" falling back to cached icon. Exception is: 
06:42:52 TEST-UNEXPECTED-FAIL | test_keyboard.py test_keyboard.TestKeyboard.test_keyboard_basic | The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

Latest build on which it is reproduced:
Gaia      874fe42b82e8d819d592690e74db91c07179e68c
Gecko     https://hg.mozilla.org/mozilla-central/rev/1417d180a1d8
BuildID   20140401040202
Version   31.0a1
ro.build.version.incremental=eng.tclxa.20131223.163538
ro.build.date=Mon Dec 23 16:36:04 CST 2013

Latest test run where the issue occured:
http://selenium.qa.mtv2.mozilla.com:8080/view/B2G%20Hamachi/job/b2g.hamachi.mozilla-central.ui.2/255/console
This issue first appeared in yesterdays build (31 March).

Showing version details
application_buildid: 20140331040200
application_changeset: d8e8f13bd4ae
application_name: B2G
application_repository: https://hg.mozilla.org/mozilla-central
application_version: 31.0a1
build_changeset: 1ad48c4be51b279f7f63c1a13025b52fe087d231
device_firmware_date: 1387433095
device_firmware_version_incremental: 324
device_firmware_version_release: 4.0.4
device_id: msm7627a
gaia_changeset: 26839cb46f856d610b192f5655a8c38a6bfe0829
gaia_date: 1396251714
gecko_changeset: 34c6e4261eb036cb6050d5b1a73cd9bc4f5f6251
platform_buildid: 20140331040200
platform_changeset: d8e8f13bd4ae
platform_repository: https://hg.mozilla.org/mozilla-central
Summary: Tests are failing on Jenkins with timeout when trying to launch apps → Tests are failing on Jenkins with timeout when trying to switch to the app frame
It seems the error is not limited to the launch method. It appears when calling self.marionette.switch_to_frame(app.frame_id) regardless of the method it's called from.
Adding qawanted  as we can not reproduce this locally with automation and manual testing.
Keywords: qawanted
(In reply to Florin Strugariu [:Bebe] from comment #3)
> Adding qawanted  as we can not reproduce this locally with automation and
> manual testing.

We really only need to do qawanted in cases where there hasn't been a manual investigation already. From what I can tell from the comments above, that already happened here.
Keywords: qawanted
Andrei, team: when you see timeouts like this, can you start attaching debugging logs, if we suspect a crash?

See https://wiki.mozilla.org/B2G/QA/Tips_And_Tricks#Debugging_OOMs for info; thanks!
From the error message:

(In reply to Robert Chira [:RobertC] from comment #0)
> test_keyboard.TestKeyboard.test_keyboard_basic | The B2G process has
> restarted after crashing during the tests so Marionette can't respond due to
> either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors
> are listed. Check logcat for all errors if these errors are not the cause of
> the failure.


This means that during this test, b2g crashed and restarted, so Marionette can't respond since the server was restarted.


The crash might be due to the errors that was printed out:

"04-01 06:36:45.129 E/GeckoConsole( 6688): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:397 in loadSVConfFileError: Failed parsing singleVariant configuration file [js/singlevariantconf.json]:  [Exception... "File error: Not found"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: app://homescreen.gaiamobile.org/gaia_build_defer_index.js :: loadFile :: line 389"  data: no]
06:42:52 04-01 06:36:47.209 E/GeckoConsole( 6688): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:60 in retrieve: Got an exception when trying to load icon "app://music2.gaiamobile.org/style/icons/60/Music.png +" falling back to cached icon. Exception is: "

But more information about the crash would be in the logcat, or in the minidumps if you have that enabled.
I found:

Bug 918982 - FILE_NOT_FOUND error fired in logcat looking for singleVariant configuration file (singlevariantconf.json)
Bug 980814 - Fix test_system_message.py, JavaScript Error: "NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE)

That look to contain the same error:
 E/GeckoConsole( 2241): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:397 in loadSVConfFileError: Failed parsing singleVariant configuration file [js/singlevariantconf.json]:  [Exception... "File error: Not found"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: app://homescreen.gaiamobile.org/gaia_build_defer_index.js :: loadFile :: line 389"  data: no]

Also when we reboot the phone we see the same issue:
E/GeckoConsole(  371): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:397 in loadSVConfFileError: Failed parsing singleVariant configuration file [js/singlevariantconf.json]:  [Exception... "File error: Not found"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: app://homescreen.gaiamobile.org/gaia_build_defer_index.js :: loadFile :: line 389"  data: no]



Manual steps:
1. flash latest build
2. open adb log cat and watch for the error


BuildID:

Gaia      0e974ff33ba47f3d1e59df1e0ad534f1bbe3ef8a
Gecko     https://hg.mozilla.org/mozilla-central/rev/91be2828f17e
BuildID   20140403040201
Version   31.0a1
ro.build.version.incremental=324
ro.build.date=Thu Dec 19 14:04:55 CST 2013
at line 231 you can find the error
This issue is reproducible in today’s Jenkins build.

Gaia      0e974ff33ba47f3d1e59df1e0ad534f1bbe3ef8a
Gecko     https://hg.mozilla.org/mozilla-central/rev/91be2828f17e
BuildID   20140403040201
Version   31.0a1
ro.build.version.incremental=324
ro.build.date=Thu Dec 19 14:04:55 CST 2013


We can't reproduce the test failure locally.
I actually managed to replicate this when running test_url_keyboard.TestUrlKeyboard locally.
When I saw it failed, the UI Tests app failed to launch. After being tapped the app opening animation did not finish.

Hamachi build:
Gaia      d9a574284d672f532f7c562a091bb01f531202b1                         
Gecko     https://hg.mozilla.org/mozilla-central/rev/6c924a018540          
BuildID   20140404040204                                                   
Version   31.0a1
(In reply to [:AndreiH] from comment #10)
> I actually managed to replicate this when running
> test_url_keyboard.TestUrlKeyboard locally.
> When I saw it failed, the UI Tests app failed to launch. After being tapped
> the app opening animation did not finish.
> 
> Hamachi build:
> Gaia      d9a574284d672f532f7c562a091bb01f531202b1                         
> Gecko     https://hg.mozilla.org/mozilla-central/rev/6c924a018540          
> BuildID   20140404040204                                                   
> Version   31.0a1

Thanks for isolating that; in the future, as requested in comment 5, can you get a minidump/crash log attached, here?  See https://wiki.mozilla.org/B2G/QA/Tips_And_Tricks#Debugging_OOMs
Have we determined here that this is a gecko bug or a bug in the test?
Flags: needinfo?(stephen.donner)
(In reply to Jason Smith [:jsmith] from comment #12)
> Have we determined here that this is a gecko bug or a bug in the test?

Looks like a core B2G process/Gecko crasher, to me.  Bebe, please work with the team to get this further debugged (look at http://selenium.qa.mtv2.mozilla.com:8080/job/b2g.hamachi.mozilla-central.tinderbox.ui.receive_call/42/ and http://selenium.qa.mtv2.mozilla.com:8080/job/b2g.hamachi.mozilla-central.unittests/370/) for smaller tests to help reproduce.  Malini/Gregor, can you take a look with the information we'll provide?  Thx.
Flags: needinfo?(stephen.donner) → needinfo?(florin.strugariu)
The problem here is not the step that's failing. It's the one just before that.

Strongly suspect this from the logcat:
I/Gecko   (  682): Security problem: Content process does not have `undefined'.  It will be killed.

Some process is being killed and thus the following marionette client step fails.
It would really be helpful to have a live video on how the tests run and what happens when they fail.
I thought to run the whole tests suite locally, thinking that 1 test will not reproduce this.
In my run I caught "test_camera_multiple_shots.py" crashed.
Also attached OOM debugging before the crash.

Screenshot: http://postimg.org/image/ko59q81sp/
Attached file locally run logcat.txt
When running the whole test suite I could see some failures locally with:

I/Gecko   (23169): 1396968219722	Marionette	INFO	sendToClient: {"from":"0","value":{"ELEMENT":"{1b27510b-b459-4034-b481-8eab8a079274}"}}, {8d1c8064-c11c-4157-9a50-205e42696fa0}, {8d1c8064-c11c-4157-9a50-205e42696fa0}
I/Gecko   (23169): 1396968219736	Marionette	INFO	sendToClient: {"from":"0","value":{"ELEMENT":"{1b27510b-b459-4034-b481-8eab8a079274}"}}, {0b42070b-9090-43f2-8363-f9511cfe55ed}, {0b42070b-9090-43f2-8363-f9511cfe55ed}
I/Gecko   (23169): 1396968219746	Marionette	DEBUG	Got request: switchToFrame, data: {"to":"0","sessionId":{"rotatable":true,"browserVersion":"31.0a1","takesScreenshot":true,"appBuildId":"20140408040204","cssSelectorsEnabled":true,"javascriptEnabled":true,"XULappId":"{3c2e2abc-06d4-11e1-ac3b-374f68613e61}","secureSsl":false,"platform":"ANDROID","browserName":"B2G","version":"31.0a1","device":"msm7627a","b2g":true,"nativeEvents":false,"platformVersion":"31.0a1","takesElementScreenshot":true,"platformName":"ANDROID","handlesAlerts":false},"name":"switchToFrame","parameters":{"focus":true,"element":"{1b27510b-b459-4034-b481-8eab8a079274}"}}, id: {41470f06-d969-48bb-a572-9d2d3592a703}
I/Gecko   (23169): 1396968219751	Marionette	INFO	Switched to frame: {"frameValue":"{1b27510b-b459-4034-b481-8eab8a079274}"}
I/Gecko   (23374): ###################################### forms.js loaded
I/Gecko   (23169): 1396968219752	Marionette	INFO	trying remote frame 0
I/Gecko   (23169): 1396968219767	Marionette	INFO	frame-manager load script: [object ChromeMessageSender]
I/Gecko   (23374): ############################### browserElementPanning.js loaded
I/Gecko   (23374): ######################## BrowserElementChildPreload.js loaded
I/Gecko   (23169): Security problem: Content process does not have `undefined'.  It will be killed.
I/Gecko   (23169): [Parent 23169] WARNING: waitpid failed pid:23374 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 261
I/Gecko   (23169): 
I/Gecko   (23169): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
I/Gecko   (23169): 
I/Gecko   (23169): ############ ErrorPage.js
E/GeckoConsole(23169): [JavaScript Error: "IndexedDB UnknownErr: IDBTransaction.cpp:864"]
I/Gonk    (23169): Setting nice for pid 23286 to 18
I/Gonk    (23169): Changed nice for pid 23286 from 18 to 18.
I/Gecko   (23169): 
I/Gecko   (23169): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
This is the logcat of 


stack trace:
Traceback (most recent call last):
File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/.env/local/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette_test.py", line 163, in run
testMethod()
File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/tests/functional/gallery/test_gallery_view.py", line 23, in test_gallery_view
gallery.launch()
File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/apps/gallery/app.py", line 24, in launch
Base.launch(self)
File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/apps/base.py", line 25, in launch
self.app = self.apps.launch(self.name, launch_timeout=launch_timeout)
File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 60, in launch
self.marionette.switch_to_frame(app.frame_id)
File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/.env/local/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette.py", line 895, in switch_to_frame
response = self._send_message('switchToFrame', 'ok', element=frame.id, focus=focus)
File "/var/jenkins/workspace/b2g.hamachi.mozilla-central.ui.2/.env/local/lib/python2.7/site-packages/marionette_client-0.7.5-py2.7.egg/marionette/marionette.py", line 604, in _send_message
"Connection timed out", status=ErrorCodes.TIMEOUT)
TimeoutException: TimeoutException: Connection timed out


Most recent errors/exceptions are:

04-08 05:42:18.149 E/GeckoConsole( 2996): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:420 in loadSVConfFileError: Failed parsing singleVariant configuration file [js/singlevariantconf.json]: [Exception... "File error: Not found" nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)" location: "JS frame :: app://homescreen.gaiamobile.org/gaia_build_defer_index.js :: loadFile :: line 412" data: no]
04-08 05:42:20.459 E/GeckoConsole( 2996): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:81 in retrieve: Got an exception when trying to load icon "app://music2.gaiamobile.org/style/icons/60/Music.png +" falling back to cached icon. Exception is:
The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.



Build ID:
Showing version details
05:34:44 application_buildid: 20140407160203
05:34:44 application_changeset: d8b2e3738785
05:34:44 application_name: B2G
05:34:44 application_repository: https://hg.mozilla.org/mozilla-central
05:34:44 application_version: 31.0a1
05:34:44 build_changeset: 1ad48c4be51b279f7f63c1a13025b52fe087d231
05:34:44 device_firmware_date: 1387433095
05:34:44 device_firmware_version_incremental: 324
05:34:44 device_firmware_version_release: 4.0.4
05:34:44 device_id: msm7627a
05:34:44 gaia_changeset: 3796021a52948a2997d9e449ab2c83279d3e1074
05:34:44 gaia_date: 1396856401
05:34:44 gecko_changeset: 63e9fff3bca167b1ab0fce67f3ae122fe82b84ba
05:34:44 platform_buildid: 20140407160203
05:34:44 platform_changeset: d8b2e3738785
05:34:44 platform_repository: https://hg.mozilla.org/mozilla-central
Thomas, would you be able to help us see if we're running into a B2G process/kernel or socket crash/error, here?
Flags: needinfo?(tzimmermann)
QA Blocker - this is causing numerous problems in our UI automation suite right now. We need to get this fixed.
blocking-b2g: --- → 1.5?
Component: Gaia::UI Tests → General
Zac - Can you find out the last time we didn't see this failure & first time it failed? I'm trying to pinpoint a regression range to help diagnose this.
Flags: needinfo?(zcampbell)
I *think* this regressed between the 19th and the 20th.

1) 19th build (GOOD): http://selenium.qa.mtv2.mozilla.com:8080/view/B2G%20Hamachi/job/b2g.hamachi.mozilla-central.unittests/291/testReport/ (might expire)

Build failures were: test_cleanup_gaia and test_connect_and_forget_all_networks (both of which were/are due to Wi-Fi, ala bug 987760 

2) 20th build (BAD): http://selenium.qa.mtv2.mozilla.com:8080/view/B2G%20Hamachi/job/b2g.hamachi.mozilla-central.unittests/292/testReport/ (might expire)

Build failures were: test_cold_launch (Clock failed to launch)
I'm afraid I'm significantly hampered by being unable to reach the MV Jenkins server from London.

I can reach Tpe job server and unfortunately there the data only goes back 1 week. The oldest failing build there is:
Gaia      4cf7ec04b82320a1ff1f825220204f454f951a87
Gecko     https://hg.mozilla.org/mozilla-central/rev/1c23863d8dda
BuildID   20140330160201
Version   31.0a1

which is 2 builds older than the one from comment #0.

The last successful report filed to qa-b2g was on Friday before that:
Gaia      9da1b9c11bf518bce882be305ae121c44c5d1e05
Gecko     https://hg.mozilla.org/mozilla-central/rev/9afe2a1145bd
BuildID   20140327040202
Version   31.0a1

Which gives a very crude regression range to start with.

We (mostly Bebe and I) think this is more associated with executing javascript, or launching an app (possibly WindowManager) rather than a problem with Marionette, which the bug title suggests, so you could start looking at commits around that. We don't think it is a Marionette problem but we could be wrong!

Stephen might be able to narrow this range further using tinderbox test runs.
Flags: needinfo?(zcampbell)
Hmm - as Zac suggests, two of these logcats show the security error firing. When that security error shows up in the logcat, that will end killing the app out right.
(In reply to Jason Smith [:jsmith] from comment #25)
> Hmm - as Zac suggests, two of these logcats show the security error firing.
> When that security error shows up in the logcat, that will end killing the
> app out right.

Which typically is usually due to a DOM bug.

Andrew - Is there someone you could suggest here that could help diagnose the security errors we are seeing firing here? This is causing major problems in a lot of tests right now running reliably.
Flags: needinfo?(overholt)
(In reply to Stephen Donner [:stephend] from comment #19)
> Thomas, would you be able to help us see if we're running into a B2G
> process/kernel or socket crash/error, here?

I can probably help a bit, but I won't promise anything. I've been chasing these Marionette failures before, but often without much luck. Is the issue reproducible locally by running test_url_keyboard.TestUrlKeyboard?
Flags: needinfo?(tzimmermann)
I grep'ed for this error message from comment 15 and found it in

  'dom/ipc/AppProcessChecker.cpp|, near line 118.

That's an internal function, which gets called by |AssertAppProcessPermission|, |AssertAppProcessManifestURL|, and |AssertAppHasPermission|. Grep'ing for these functions and sorting out impossible results, I get

  content/base/src/nsFrameLoader.cpp:2303:  return AssertAppProcessPermission(GetRemoteBrowser(),
  content/base/src/nsFrameLoader.cpp:2310:  return AssertAppProcessManifestURL(GetRemoteBrowser(),
  dom/devicestorage/DeviceStorageRequestParent.cpp:365:  if (!AssertAppProcessPermission(aParent, permissionName.get())) {
  dom/filesystem/FileSystemRequestParent.cpp:68:    if (!AssertAppProcessPermission(aParent, permissionName.get())) {
  dom/indexedDB/ipc/IndexedDBParent.cpp:146:         !AssertAppProcessPermission(mManagerContent, fullPermission.get())) ||
  dom/indexedDB/ipc/IndexedDBParent.cpp:148:         !AssertAppProcessPermission(mManagerTab, fullPermission.get()))) {
  dom/ipc/ContentParent.cpp:3221:  return AssertAppProcessPermission(this, NS_ConvertUTF16toUTF8(aPermission).get());
  dom/ipc/ContentParent.cpp:3227:  return AssertAppProcessManifestURL(this, NS_ConvertUTF16toUTF8(aManifestURL).get());

Can we get a stack trace?
Taipei CI is seeing the same problem but in the test failure stacktrace is consistently showing this error:

04-09 10:56:18.089 E/GeckoConsole( 7601): [JavaScript Error: "A promise chain failed to handle a rejection.
04-09 10:56:18.089 E/GeckoConsole( 7601): Full Message: TypeError: worker is null

However Taipei CI server does not capture the full logcat so I cannot correlate that with the MV CI server logs.
I managed to replicate this locally when testing "test_email_keyboard.TestEmailKeyboard" with repeat 30.
UI Tests app just stayed there after launching it, but not opening.

Traceback: https://pastebin.mozilla.org/4780229
Logcat: https://pastebin.mozilla.org/4780232
(In reply to [:AndreiH] from comment #30)
> I managed to replicate this locally when testing
> "test_email_keyboard.TestEmailKeyboard" with repeat 30.
> UI Tests app just stayed there after launching it, but not opening.
> 
> Traceback: https://pastebin.mozilla.org/4780229
> Logcat: https://pastebin.mozilla.org/4780232

The logcat is incomplete; can you add it as an attachment so it will live longer?
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #28)
 
> Can we get a stack trace?

We'd be happy to help you help us; I'm not sure which stack trace, though -- can you clarify on what's needed?  FWIW, we're running Engineering builds (with Marionette enabled), but they're not debug AFAIK, otherwise.
(In reply to Zac C (:zac) from comment #15)
> The problem here is not the step that's failing. It's the one just before
> that.
> 
> Strongly suspect this from the logcat:
> I/Gecko   (  682): Security problem: Content process does not have
> `undefined'.  It will be killed.
> 
> Some process is being killed and thus the following marionette client step
> fails.

This means something is calling AssertAppProcess() passing in "undefined" as the last argument.  Looking at the consumers in Gecko, I can't find anywhere that we pass "undefined" down to this function.  This function however is indirectly exposed to JS (here is one example, not saying this call site has necessarily anything to do with this bug: http://mxr.mozilla.org/mozilla-central/source/dom/downloads/src/DownloadsAPI.jsm#145)

Now given the above, here's my theory.  I bet some JS code is doing something like:

  messageMananger.assertPermission(foo.bar);

where foo doesn't have a bar property.  That creates a JS undefined value, which stringifies to "undefined".

This means we need to know the JS caller.  The easiest way I can think of to do that would be to attach gdb to the child process, reproduce the bug so that this crash happens, and then do: |call DumpJSStack()| in gdb.
Instructions for attaching gdb to the child process: <https://developer.mozilla.org/en-US/Firefox_OS/Debugging/Debugging_B2G_using_gdb#Debugging_out-of-process_tasks>

You basically need to run b2g-ps, get the pid for the process that you're interested in (looking at the name on the first column) and then do ./run-gdb attach <pid>.
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #35)
> Instructions for attaching gdb to the child process:
> <https://developer.mozilla.org/en-US/Firefox_OS/Debugging/
> Debugging_B2G_using_gdb#Debugging_out-of-process_tasks>
> 
> You basically need to run b2g-ps, get the pid for the process that you're
> interested in (looking at the name on the first column) and then do
> ./run-gdb attach <pid>.

:retornam, would you be able to try to get a stack for us, using one of the failing unittests?  (test_cleanup_gaia, run repeatedly until it fails, looks like a good, lean candidate).  Thanks either way.
Flags: needinfo?(mozbugs.retornam)
Severity: normal → critical
(Please try to get a C++ stack using the gdb bt command as well, that will also be useful.  Thanks!)
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #35)
> Instructions for attaching gdb to the child process:
> <https://developer.mozilla.org/en-US/Firefox_OS/Debugging/
> Debugging_B2G_using_gdb#Debugging_out-of-process_tasks>
> 
> You basically need to run b2g-ps, get the pid for the process that you're
> interested in (looking at the name on the first column) and then do
> ./run-gdb attach <pid>.

Err, sorry, make that the parent process.
Flags: needinfo?(overholt)
(In reply to Stephen Donner [:stephend] from comment #36)
> (In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment
> #35)
> > Instructions for attaching gdb to the child process:
> > <https://developer.mozilla.org/en-US/Firefox_OS/Debugging/
> > Debugging_B2G_using_gdb#Debugging_out-of-process_tasks>
> > 
> > You basically need to run b2g-ps, get the pid for the process that you're
> > interested in (looking at the name on the first column) and then do
> > ./run-gdb attach <pid>.
> 
> :retornam, would you be able to try to get a stack for us, using one of the
> failing unittests?  (test_cleanup_gaia, run repeatedly until it fails, looks
> like a good, lean candidate).  Thanks either way.

After attaching the debugger, enter

  break AppProcessChecker.cpp:120

before continuing the b2g process. This will make it stop right after the failed permission check happened. Entering

  bt

will give you the stack trace.
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #34)
> (In reply to Zac C (:zac) from comment #15)
> > The problem here is not the step that's failing. It's the one just before
> > that.
> > 
> > Strongly suspect this from the logcat:
> > I/Gecko   (  682): Security problem: Content process does not have
> > `undefined'.  It will be killed.
> > 
> > Some process is being killed and thus the following marionette client step
> > fails.
> 
> This means something is calling AssertAppProcess() passing in "undefined" as
> the last argument.  Looking at the consumers in Gecko, I can't find anywhere
> that we pass "undefined" down to this function.  This function however is
> indirectly exposed to JS (here is one example, not saying this call site has
> necessarily anything to do with this bug:
> http://mxr.mozilla.org/mozilla-central/source/dom/downloads/src/DownloadsAPI.
> jsm#145)
> 
> Now given the above, here's my theory.  I bet some JS code is doing
> something like:
> 
>   messageMananger.assertPermission(foo.bar);
> 
> where foo doesn't have a bar property.  That creates a JS undefined value,
> which stringifies to "undefined".
> 
> This means we need to know the JS caller.  The easiest way I can think of to
> do that would be to attach gdb to the child process, reproduce the bug so
> that this crash happens, and then do: |call DumpJSStack()| in gdb.

Now I can see this logcat line in passing tests so I don't think it is a problem anymore. Or at least, not this problem!
(In reply to Zac C (:zac) from comment #40)
> Now I can see this logcat line in passing tests so I don't think it is a
> problem anymore. Or at least, not this problem!

That's actually good, because I have been able to reproduce a crash with test_url_keyboard.py, but wasn't seeing the security error mentioned above. I am seeing the following error though, which I am currently investigating.

E/GeckoConsole(  581): [JavaScript Error: "A promise chain failed to handle a rejection.
E/GeckoConsole(  581):
E/GeckoConsole(  581): Date: Thu Apr 10 2014 12:03:05 GMT-0400 (UTC)
E/GeckoConsole(  581): Full Message: TypeError: worker is null
E/GeckoConsole(  581): Full Stack: post/</<@resource://gre/modules/osfile/osfile_async_front.jsm:417:11
E/GeckoConsole(  581): TaskImpl_run@resource://gre/modules/Task.jsm:282:1
E/GeckoConsole(  581): TaskImpl@resource://gre/modules/Task.jsm:247:3
E/GeckoConsole(  581): createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:224:7
E/GeckoConsole(  581): Task_spawn@resource://gre/modules/Task.jsm:139:5
E/GeckoConsole(  581): post/<@resource://gre/modules/osfile/osfile_async_front.jsm:402:28
E/GeckoConsole(  581): Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:748:11
E/GeckoConsole(  581): this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:627:7
E/GeckoConsole(  581): " {file: "resource://gre/modules/osfile/osfile_async_front.jsm" line: 417 column: 0 source: "417"}]
That looks like a separate thing Michael.

Just this afternoon I've had some luck replicating this, I can now hit it over 50% of the time. It seems to be a bit of a race between the startup of the device and the launching of the app.

I might have to call in Alive for some help soon :)
(In reply to Michael Henretty [:mhenretty] from comment #41)
> (In reply to Zac C (:zac) from comment #40)
> > Now I can see this logcat line in passing tests so I don't think it is a
> > problem anymore. Or at least, not this problem!
> 
> That's actually good, because I have been able to reproduce a crash with
> test_url_keyboard.py, but wasn't seeing the security error mentioned above.
> I am seeing the following error though, which I am currently investigating.
> 
> E/GeckoConsole(  581): [JavaScript Error: "A promise chain failed to handle
> a rejection.
> E/GeckoConsole(  581):
> E/GeckoConsole(  581): Date: Thu Apr 10 2014 12:03:05 GMT-0400 (UTC)
> E/GeckoConsole(  581): Full Message: TypeError: worker is null
> E/GeckoConsole(  581): Full Stack:
> post/</<@resource://gre/modules/osfile/osfile_async_front.jsm:417:11
> E/GeckoConsole(  581): TaskImpl_run@resource://gre/modules/Task.jsm:282:1
> E/GeckoConsole(  581): TaskImpl@resource://gre/modules/Task.jsm:247:3
> E/GeckoConsole(  581):
> createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:224:7
> E/GeckoConsole(  581): Task_spawn@resource://gre/modules/Task.jsm:139:5
> E/GeckoConsole(  581):
> post/<@resource://gre/modules/osfile/osfile_async_front.jsm:402:28
> E/GeckoConsole(  581):
> Handler.prototype.process@resource://gre/modules/Promise.jsm ->
> resource://gre/modules/Promise-backend.js:748:11
> E/GeckoConsole(  581):
> this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm ->
> resource://gre/modules/Promise-backend.js:627:7
> E/GeckoConsole(  581): " {file:
> "resource://gre/modules/osfile/osfile_async_front.jsm" line: 417 column: 0
> source: "417"}]

David can tell us what this means.
Flags: needinfo?(dteller)
Alive, to summarise this bug, in the UI-tests test suite we are having a problem with apps freezing when launched (using the atom which you have worked on with us before).

The app freezes in its launch state with the icon in the middle. The phone is still active, but no apps can be launched, they all fail to launch.

By building a new Gaia profile with NOFTU=1 I managed to get to replicate it >80% of the time. I've narrowed it down to an issue with the timing of the startup. If I had a sleep(10) at the start of the test before launching the app it works fine, 100% passes.

I am thinking that the AppWindowManager or something is not fully initialised before we try to launch the app. Has something changed recently that regressed the startup of APIs? It seems like we need to add another wait into the startup of the b2g process to wait for something to be ready.
Flags: needinfo?(alive)
Attached file github pr (obsolete) —
A new app which is was introduced in bug 978785. The killAll method in the atoms was killing this app and for reasons to be honest I don't understand was causing apps to fail to launch.
I've modified killAll to not kill this app. (another option would be to remove killAll and only kill the FTU).

Alive, I'm still interested in your ni? if you can provide context to why this fix works in the context of my ni? !
Attachment #8404969 - Flags: review?(dave.hunt)
Comment on attachment 8404969 [details] [review]
github pr

This patch worked locally but not on other devices.

I think the patch is valid in another context so I'll file it there.
Attachment #8404969 - Attachment is obsolete: true
Attachment #8404969 - Flags: review?(dave.hunt)
I guess that patch just changed the timing subtly enough to allow the apps to launch.
Interesting. It means that the race condition I thought fixed in bug 961317 is not fully fixed yet. sigh.
I'll take a look tomorrow.

We should really activate OS.File's xpcshell tests on B2G to catch these errors earlier. Is there anything that prevents this?

(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #43)
[...]
> > E/GeckoConsole(  581): [JavaScript Error: "A promise chain failed to handle
> > a rejection.
> > E/GeckoConsole(  581):
> > E/GeckoConsole(  581): Date: Thu Apr 10 2014 12:03:05 GMT-0400 (UTC)
> > E/GeckoConsole(  581): Full Message: TypeError: worker is null
> > E/GeckoConsole(  581): Full Stack:
> > post/</<@resource://gre/modules/osfile/osfile_async_front.jsm:417:11
[...]
(In reply to comment #48)
> We should really activate OS.File's xpcshell tests on B2G to catch these errors
> earlier. Is there anything that prevents this?

Please file a bug for that!  I guess the try server can answer that question.
(In reply to Michael Henretty [:mhenretty] from comment #41)
> I am seeing the following error though, which I am currently investigating.
> 
> E/GeckoConsole(  581): [JavaScript Error: "A promise chain failed to handle
> a rejection.
> E/GeckoConsole(  581):
> E/GeckoConsole(  581): Date: Thu Apr 10 2014 12:03:05 GMT-0400 (UTC)
> E/GeckoConsole(  581): Full Message: TypeError: worker is null
> E/GeckoConsole(  581): Full Stack:
> post/</<@resource://gre/modules/osfile/osfile_async_front.jsm:417:11
> E/GeckoConsole(  581): TaskImpl_run@resource://gre/modules/Task.jsm:282:1
> E/GeckoConsole(  581): TaskImpl@resource://gre/modules/Task.jsm:247:3
> E/GeckoConsole(  581):
> createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:224:7
> E/GeckoConsole(  581): Task_spawn@resource://gre/modules/Task.jsm:139:5
> E/GeckoConsole(  581):
> post/<@resource://gre/modules/osfile/osfile_async_front.jsm:402:28
> E/GeckoConsole(  581):
> Handler.prototype.process@resource://gre/modules/Promise.jsm ->
> resource://gre/modules/Promise-backend.js:748:11
> E/GeckoConsole(  581):
> this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm ->
> resource://gre/modules/Promise-backend.js:627:7
> E/GeckoConsole(  581): " {file:
> "resource://gre/modules/osfile/osfile_async_front.jsm" line: 417 column: 0
> source: "417"}]

As Zac mentioned in comment 42, this is probably unrelated to the automation failure. Bug 995162 has been filed to track that.
See Also: → 995162
Flags: needinfo?(dteller)
Discussed with Clint & Stephen - here's what we're going to do:

* We'll implement the sleep hack to get the tests working again
* We'll do a manual bisection of engineering builds from 3/19 - 3/20 to find the regressing patch
I've put the workaround in to unblock Taipei on their Monday:
https://github.com/mozilla-b2g/gaia/commit/68d8e6897f326a8980c81da80bf8a76440ef3faf

By reducing the items in the gaiatest setUp we can trigger this bug more easily. Previously we could only hit it 1 in 30 which made bisecting difficult but with some tweaks we should be able to bisect more effectively.
(In reply to Zac C (:zac) from comment #44)
> Alive, to summarise this bug, in the UI-tests test suite we are having a
> problem with apps freezing when launched (using the atom which you have
> worked on with us before).
> 
> The app freezes in its launch state with the icon in the middle. The phone
> is still active, but no apps can be launched, they all fail to launch.
> 
> By building a new Gaia profile with NOFTU=1 I managed to get to replicate it
> >80% of the time. I've narrowed it down to an issue with the timing of the
> startup. If I had a sleep(10) at the start of the test before launching the
> app it works fine, 100% passes.
> 
> I am thinking that the AppWindowManager or something is not fully
> initialised before we try to launch the app. Has something changed recently
> that regressed the startup of APIs? It seems like we need to add another
> wait into the startup of the b2g process to wait for something to be ready.

Uh....I don't think 'freeze' is a gaia issue anyway. We cannot 'stop' any app from gaia side.
Let's have the regression window instead of just guessing. Thanks!
If it's proven to be gaia side I will take a look.
Flags: needinfo?(alive)
Thanks Alive,

We are bisecting on-device builds right now, results in a couple of hours.
Last passing (source: releng): 20140329040206
First failing (source: tbpl): 20140329090231

The Gecko range for this is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c401296f71ae&tochange=8da3aabb044f

The Gaia commits were identical between the last passing build and the first failing.
https://hg.mozilla.org/mozilla-central/rev/95357ab4ea7e
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Realize now that this should have been marked leave open.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Zac C (:zac) from comment #55)
> Last passing (source: releng): 20140329040206
> First failing (source: tbpl): 20140329090231
> 
> The Gecko range for this is:
> http://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=c401296f71ae&tochange=8da3aabb044f
> 
> The Gaia commits were identical between the last passing build and the first
> failing.

Good - that at least proves with hard evidence that this is a gecko bug.
(In reply to Jason Smith [:jsmith] from comment #58)
> (In reply to Zac C (:zac) from comment #55)
> > Last passing (source: releng): 20140329040206
> > First failing (source: tbpl): 20140329090231
> > 
> > The Gecko range for this is:
> > http://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=c401296f71ae&tochange=8da3aabb044f
> > 
> > The Gaia commits were identical between the last passing build and the first
> > failing.
> 
> Good - that at least proves with hard evidence that this is a gecko bug.

Note - you can go a bit a deeper with the window if you continue the bisection using mozilla-inbound builds.

https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/tinderbox-builds/mozilla-inbound-hamachi-eng/

That will get us down to only a couple of changesets.
Ah I didn't realise they were available. We'll get onto it straight away!
(In reply to comment #61)
> The first failing m-i build is:
> https://tbpl.mozilla.org/?showall=1&tree=Mozilla-Inbound&rev=574ec49a35b8

Can you please continue bisecting the b2g-inbound range there?
Yep!
First failing b2g-i is: https://tbpl.mozilla.org/?showall=1&tree=B2g-Inbound&rev=95130f72756e

Not sure where to go from here, is there a greater resolution of builds I can test?
(In reply to comment #64)
> First failing b2g-i is:
> https://tbpl.mozilla.org/?showall=1&tree=B2g-Inbound&rev=95130f72756e
> 
> Not sure where to go from here, is there a greater resolution of builds I can
> test?

This range includes commits from inbound and fx-team.  I suspect the inbound commits are more interesting.  I don't really know what the best way to use the inbound/b2g-inbound builds for bisection is but it looks like you're getting stuck in finding the regression on inbound and b2g-inbound back and forth.  One option would be building and bisecting locally, I have no further ideas.
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #65)
> (In reply to comment #64)
> > First failing b2g-i is:
> > https://tbpl.mozilla.org/?showall=1&tree=B2g-Inbound&rev=95130f72756e
> > 
> > Not sure where to go from here, is there a greater resolution of builds I can
> > test?
> 
> This range includes commits from inbound and fx-team.  I suspect the inbound
> commits are more interesting.  I don't really know what the best way to use
> the inbound/b2g-inbound builds for bisection is but it looks like you're
> getting stuck in finding the regression on inbound and b2g-inbound back and
> forth.  One option would be building and bisecting locally, I have no
> further ideas.

We shouldn't be bisecting b2g-inbound here - the m-c regression ranges points to a regression from a landing on mozilla-inbound, not b2g-inbound.
Weird. The mozilla-inbound push log indicates that this happened post merging m-c back into mozilla-inbound.

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=245d0cb5a7b3&tochange=574ec49a35b8
For the sake of my own sanity - can we confirm the m-c & mozilla-inbound ranges are correct with some more testing of the builds for last working & first failed?
(In reply to Jason Smith [:jsmith] from comment #68)
> For the sake of my own sanity - can we confirm the m-c & mozilla-inbound
> ranges are correct with some more testing of the builds for last working &
> first failed?

I've gone through a lot of the builds on the last-working side and refined the range down to:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=88ae1bfaaf3d&tochange=eb6da95b659e

I will move to m-i bisection tomorrow.
Summary: Tests are failing on Jenkins with timeout when trying to switch to the app frame → Tests are failing; app freezes after being launched
OK after comment #69 I changed my replication scenario to increase the number of repetitions to 10. Previously it was 5 and sometimes all these tests would pass giving a false positive.

After bisecting on b2g-i the range from comment #69 I have narrowed it down to one changeset:
http://hg.mozilla.org/integration/b2g-inbound/pushloghtml?fromchange=996fd5d8f606&tochange=0bc0f7044d51
which points to: https://bugzilla.mozilla.org/show_bug.cgi?id=936703

After getting this far I again went back and forward over the lw and ff and ran each again 10 times which confirmed my initial results.

Then I flashed with the latest nightly and re-ran my test scenario to be sure that the latest master is still hitting the problem and I confirmed that it is.

I reverted the Gaia commit landed in bug 936703 and reset-gaia and ran the test cases again 20 times, more than it usually takes to fail. I'm 100% sure this is the offending commit now.
Flags: needinfo?(mozbugs.retornam)
Depends on: 936703
Vicamo - Any ideas?
Flags: needinfo?(vyang)
Vicamo, I've left the steps to replicate with Askeing and Eric if you want to replicate it locally.
blocking-b2g: 2.0? → 2.0+
I have following errors as well with bug 936703 reverted.  I don't think those error messages mentioned above have anything to do with bug 936703.  I'm going to run Zac's test script locally.  Thank you.

E/GeckoConsole(   44): [JavaScript Warning: "mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the correct initial [[Prototype]]
 value using Object.create" {file: "resource://gre/modules/Preferences.jsm" line: 378}]

E/GeckoConsole(   44): [JavaScript Warning: "TypeError: mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the correct initial [[
Prototype]] value using Object.create" {file: "app://system.gaiamobile.org/js/browser_context_menu.js" line: 31}]

I/Gecko   (   44): *************************
I/Gecko   (   44): A coding exception was thrown and uncaught in a Task.
I/Gecko   (   44):
I/Gecko   (   44): Full message: TypeError: worker is null
I/Gecko   (   44): Full stack: post/</<@resource://gre/modules/osfile/osfile_async_front.jsm:417:11
I/Gecko   (   44): TaskImpl_run@resource://gre/modules/Task.jsm:282:1
I/Gecko   (   44): TaskImpl@resource://gre/modules/Task.jsm:247:3
I/Gecko   (   44): createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:224:7
I/Gecko   (   44): Task_spawn@resource://gre/modules/Task.jsm:139:5
I/Gecko   (   44): post/<@resource://gre/modules/osfile/osfile_async_front.jsm:402:28
I/Gecko   (   44): Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:863:11
I/Gecko   (   44): this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:742:7
I/Gecko   (   44):
I/Gecko   (   44): *************************

E/GeckoConsole(   44): [JavaScript Error: "A promise chain failed to handle a rejection.
E/GeckoConsole(   44):
E/GeckoConsole(   44): Date: Wed Apr 16 2014 06:02:08 GMT+0000 (UTC)
E/GeckoConsole(   44): Full Message: TypeError: worker is null" {file: "resource://gre/modules/osfile/osfile_async_front.jsm" line: 417 column: 0 source: "417"}]

E/GeckoConsole(  169): Content JS ERROR at app://homescreen.gaiamobile.org/js/configurator.js:144 in loadSVConfFileError: Failed parsing singleVariant configuration file [js/singlevariantconf.json]:  [Exception... "File error: Not found"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: app://homescreen.gaiamobile.org/js/configurator.js :: loadFile :: line 70"  data: no]

E/GeckoConsole(  169): Content JS ERROR at app://homescreen.gaiamobile.org/js/icon_retriever.js:67 in onerror: error  while HTTP GET:  http://inapp-pay-test.paas.allizom.org/img/out-64.png
E/GeckoConsole(  169): Content JS ERROR at app://homescreen.gaiamobile.org/js/icon_retriever.js:67 in onerror: error  while HTTP GET:  app://music2.gaiamobile.org/style/icons/60/Music.png
Vicamo yes I agree that the error messages we were investigating in the past are unrelated.

When the app freezes it leave no error messages at all so it's quite confusing!
In this archive I placed four log files,

  1) with bug 936703, run Gaia test_cold_launch.py in FTU,
  2) with bug 936703, run Gaia test_cold_launch.py in Homescreen,
  3) with bug 936703 REVERTED, run Gaia test_cold_launch.py in FTU,
  4) with bug 936703 REVERTED, run Gaia test_cold_launch.py in Homescreen,

It's all based on https://github.com/zacc/gaia/tree/bug_990580-bisection .  Commands executed:

  shell-1$ cd $(B2G_HOME)
  shell-1$ cd gaia; O=`git clean -d -x -n | cut -d ' ' -f 3 | cut -d/ -f 1 | sort -u | grep -v xulrunner`; [ -n "$O" ] && git clean -d -x -f $O; cd -
  shell-1$ cd gaia/tests/python/gaia-ui-tests/; python setup.py develop; cd -
  shell-1$ nice -n 19 ./build.sh && ./run-emulator.sh
  shell-2$ adb forward tcp:2828 tcp:2828
  shell-2$ (wait for either FTU or Homescreen)
  shell-2$ gaiatest --address=localhost:2828 --testvars=`pwd`/testvars.json --restart --timeout 30000 --repeat 10 gaia/tests/python/gaia-ui-tests/gaiatest/tests/unit/test_cold_launch.py

I never have a successful run in any of them.  All the test runs failed with timeout error.  Please help point out if I'm wrong in some step.
Flags: needinfo?(vyang) → needinfo?(zcampbell)
"A timeout error" ? Please be more specific.

I mean this bug does cause a timeout error so you're either hitting the failure or you've done something else wrong.
Flags: needinfo?(zcampbell) → needinfo?(vyang)
The console would be more helpful than the logcat.
Flags: needinfo?(vyang)
(In reply to Vicamo Yang [:vicamo][:vyang] from comment #79)
> Created attachment 8407404 [details]
> bug-990580_gaia-cold-launch_qemu-ics_console.txt

What device and Gaia build is this? It looks like Gaia is failing to start completely, possibly there is no profile or it's corrupted.
I think you should re-flash with the nightly build and just run the steps as you have.

Askeing can you help Vicamo to replicate this?
Flags: needinfo?(fyen)
(In reply to Zac C (:zac) from comment #80)
> (In reply to Vicamo Yang [:vicamo][:vyang] from comment #79)
> > Created attachment 8407404 [details]
> > bug-990580_gaia-cold-launch_qemu-ics_console.txt
> 
> What device and Gaia build is this? It looks like Gaia is failing to start
> completely, possibly there is no profile or it's corrupted.
> I think you should re-flash with the nightly build and just run the steps as
> you have.

It's based on emulator-ics, Gecko m-c HEAD (http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=fa7c5ad357b11c1c11d71e14f28d97cc3c05ef74), Gaia https://github.com/zacc/gaia/tree/bug_990580-bisection .  I can surely boot to FTU & Homescreen.
OK that's why it is timing out - emulator is very slow to load. I don't expect you can replicate this on the emulator build.
You need to use a Hamachi/Buri.
I don't think that's a valid explanation especially when emulator is mostly idle during the process.  I'll still run the test on Nexus-5 again, which is the fastest device I have.
Ok you can try again with emulator but you need to increase the timeout on line 715 of gaia_test.py to around 120 seconds and on the command line use maybe --timeout=60000 because it is slower. Even then I can't be sure you'll replicate this issue as I've only tried on device!
We can replicate this issue on Buri. If Vicamo want to try other device, then I will try to help him to setup the environment.
Flags: needinfo?(fyen)
Thanks Askeing!
I think Buri/Hamachi is the most important because that's the device on MV and Tpe CIs where this is blocking us.
I can replicate this with a Nexus-5 on the first shot.
Attachment #8407396 - Attachment is obsolete: true
Attachment #8407404 - Attachment is obsolete: true
(In reply to Vicamo Yang [:vicamo][:vyang] from comment #87)
> Created attachment 8408162 [details]
> bug-990580_gaia-cold-launch_nexus-5_adb-logcat.zip
> 
> I can replicate this with a Nexus-5 on the first shot.

What do you mean, did the test fail with and without the patch applied?
Flags: needinfo?(vyang)
(In reply to Zac C (:zac) from comment #88)
> (In reply to Vicamo Yang [:vicamo][:vyang] from comment #87)
> > Created attachment 8408162 [details]
> > bug-990580_gaia-cold-launch_nexus-5_adb-logcat.zip
> > 
> > I can replicate this with a Nexus-5 on the first shot.
> 
> What do you mean, did the test fail with and without the patch applied?

Test pass with bug 936703 reverted; failed otherwise.  However, I have only one success (with bug 936703 reverted) and one failure so far.  Nexus-5 often fails to boot to homescreen.
Flags: needinfo?(vyang)
I think we can narrow down to |asyncStorage.getItem|.  If I partially revert what's done to that in bug 936703, then that switchToFrame timeout can't be easily replicated.
(In reply to Vicamo Yang [:vicamo][:vyang] from comment #90)
> I think we can narrow down to |asyncStorage.getItem|.  If I partially revert
> what's done to that in bug 936703, then that switchToFrame timeout can't be
> easily replicated.

Thanks Vicamo that's positive news. How do you think we should proceed; a backout or can you fix it?
It's "apps/system/js/ftu_launcher.js".  Somehow if the |asyncStorage.getItem| call in it only returns result on transaction oncomplete event, then the timeout comes back.  My steps to reproduce:

  1) add an |asyncStorage.getItemUnsafe| call that returns result in
     IDBRequest.onsuccess event.  => test PASS,
  2) let every call to |asyncStorage.getItem| call
     |asyncStorage.getItemUnsafe| instead.  Now it's practically
     identical to revert bug 936703. => test PASS,
  3) call to |asyncStorage.getItem| in "apps/system/js/ftu_launcher.js"
     => test FAIL
(In reply to Zac C (:zac) from comment #91)
> (In reply to Vicamo Yang [:vicamo][:vyang] from comment #90)
> > I think we can narrow down to |asyncStorage.getItem|.  If I partially revert
> > what's done to that in bug 936703, then that switchToFrame timeout can't be
> > easily replicated.
> 
> Thanks Vicamo that's positive news. How do you think we should proceed; a
> backout or can you fix it?

You want a backout then you get a backout, I won't stop you from that, but I guess somebody else does.  The problem is the unsafe IndexedDB usage in System App, not in those IndexedDB helpers.
Thanks for your excellent help Vicamo.

I won't force a backout, if it will only take a couple of days then we can wait to fix the unsafe usage in ftu_launcher.js. Let's file a bug for that (I'm supposed to be on holiday today!)
Depends on: 998255
https://bugzilla.mozilla.org/show_bug.cgi?id=998255 is filed for the unsafe usage in the System app.
See Also: → 991262
No longer blocks: 987760
Something in this Gecko range has solved this bug:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5010b38abf18&tochange=78b0aaa91320

Just going to complete a bit more testing today and then will change the status of this bug.
Attachment #8410163 - Flags: review?(florin.strugariu)
Attachment #8410163 - Flags: review?(florin.strugariu) → review+
(In reply to Zac C (:zac) from comment #96)
> Something in this Gecko range has solved this bug:
> http://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=5010b38abf18&tochange=78b0aaa91320
> 
> Just going to complete a bit more testing today and then will change the
> status of this bug.


This is working very reliably now. I don't think I have time to do a better analysis of how it was fixed.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.