Closed Bug 906933 Opened 6 years ago Closed 6 years ago

Failing to connect/reconnect to cell data

Categories

(Firefox OS Graveyard :: General, defect, critical)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: stephend, Unassigned)

References

Details

(Whiteboard: [fromAutomation])

Attachments

(6 files)

Attached file logcat.txt
Gecko: http://hg.mozilla.org/mozilla-central/rev/c8c9bd74cc40
Gaia: https://github.com/mozilla-b2g/gaia/commit/f6de05c135913f2cb790759335875bb1b3c4f9bb
Device/build: Unagi, on master

Looks like bug 901844 is suspect?

Test: https://github.com/mozilla/gaia-ui-tests/blob/master/gaiatest/tests/unit/test_connect_to_network.py

gaiatest --testvars=/home/webqa/webqa-credentials/b2g/b2g-12.json --address=localhost:2828 --xml-output=results/results.xml --html-output=results/index.html --restart --repeat=0 gaiatest/tests/unit/test_connect_to_network.py
running webserver on http://10.250.1.87:39594/
TEST-START test_connect_to_network.py

test_connect_to_network (test_connect_to_network.TestConnectToNetwork) ... ERROR

======================================================================
ERROR: test_connect_to_network (test_connect_to_network.TestConnectToNetwork)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette_test.py", line 132, in run
    testMethod()
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/gaiatest/tests/unit/test_connect_to_network.py", line 12, in test_connect_to_network
    self.connect_to_network()
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/gaiatest/gaia_test.py", line 534, in connect_to_network
    self.data_layer.connect_to_cell_data()
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/gaiatest/gaia_test.py", line 226, in connect_to_cell_data
    result = self.marionette.execute_async_script("return GaiaDataLayer.connectToCellData()", special_powers=True)
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 642, in execute_async_script
    filename=os.path.basename(frame[0]))
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 352, in _send_message
    self._handle_error(response)
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 401, in _handle_error
    raise ScriptTimeoutException(message=message, status=status, stacktrace=stacktrace)
TEST-UNEXPECTED-FAIL | test_connect_to_network.py TestConnectToNetwork.test_connect_to_network | ScriptTimeoutException: timed out
----------------------------------------------------------------------
Ran 1 test in 165.194s

I'll also attach the full logcat
Severity: normal → critical
Flags: needinfo?(frsela)
Keywords: regression
OS: Windows 7 → Gonk (Firefox OS)
Hardware: x86_64 → ARM
I don't think bug 901844 would likely be the culprit here. That bug focuses on SIM toolkit, where as this is talking cell data.
Analyzing the Jenkins test failures, the browser cell data test started failing on August 16th midday. That would imply the regression happened on August 15th.
Hsin-Yi - We're seeing continuous failures with connecting to cellular data since August 16th. I see that you've worked on the ril_worker.js recently in the changelog - http://hg.mozilla.org/mozilla-central/filelog/b7f636fada9f/dom/system/gonk/ril_worker.js. Do you have any ideas what caused this regression? Any assistance here would be greatly appreciated to help get our cell data tests back up and running.
Flags: needinfo?(htsai)
(In reply to Jason Smith [:jsmith] from comment #3)
> Hsin-Yi - We're seeing continuous failures with connecting to cellular data
> since August 16th. I see that you've worked on the ril_worker.js recently in
> the changelog -
> http://hg.mozilla.org/mozilla-central/filelog/b7f636fada9f/dom/system/gonk/
> ril_worker.js. Do you have any ideas what caused this regression? Any
> assistance here would be greatly appreciated to help get our cell data tests
> back up and running.

Hi Jason, thanks for the information. But I don't think the change in bug 905076 is the cause for cellular data failure. That bug is adding a condition check to see if the network is gsm/cdma when dialing out a voice call, not touching data connection.
Flags: needinfo?(htsai)
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #4)
> (In reply to Jason Smith [:jsmith] from comment #3)
> > Hsin-Yi - We're seeing continuous failures with connecting to cellular data
> > since August 16th. I see that you've worked on the ril_worker.js recently in
> > the changelog -
> > http://hg.mozilla.org/mozilla-central/filelog/b7f636fada9f/dom/system/gonk/
> > ril_worker.js. Do you have any ideas what caused this regression? Any
> > assistance here would be greatly appreciated to help get our cell data tests
> > back up and running.
> 
> Hi Jason, thanks for the information. But I don't think the change in bug
> 905076 is the cause for cellular data failure. That bug is adding a
> condition check to see if the network is gsm/cdma when dialing out a voice
> call, not touching data connection.

Right. The regressing patch here would have landed on August 15th, so it wouldn't be caused by bug 905076. I think what we're trying to figure out here is that there was something that landed in the RIL code around August 15th that would have caused this regression.
(In reply to Jason Smith [:jsmith] from comment #5)
> (In reply to Hsin-Yi Tsai [:hsinyi] from comment #4)
> > (In reply to Jason Smith [:jsmith] from comment #3)
> > > Hsin-Yi - We're seeing continuous failures with connecting to cellular data
> > > since August 16th. I see that you've worked on the ril_worker.js recently in
> > > the changelog -
> > > http://hg.mozilla.org/mozilla-central/filelog/b7f636fada9f/dom/system/gonk/
> > > ril_worker.js. Do you have any ideas what caused this regression? Any
> > > assistance here would be greatly appreciated to help get our cell data tests
> > > back up and running.
> > 
> > Hi Jason, thanks for the information. But I don't think the change in bug
> > 905076 is the cause for cellular data failure. That bug is adding a
> > condition check to see if the network is gsm/cdma when dialing out a voice
> > call, not touching data connection.
> 
> Right. The regressing patch here would have landed on August 15th, so it
> wouldn't be caused by bug 905076. I think what we're trying to figure out
> here is that there was something that landed in the RIL code around August
> 15th that would have caused this regression.

I quickly overviewed the change log but nothing looks obviously, though bug 903257 might be a possible one. I am ccing other RIL team members, hopefully we can get more hints here. And this bug happens on automation test, not on physical machines right? Thank you.
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #6)

> I quickly overviewed the change log but nothing looks obviously, though bug
> 903257 might be a possible one. I am ccing other RIL team members, hopefully
> we can get more hints here. And this bug happens on automation test, not on
> physical machines right? Thank you.

This is on real Unagis, but on automation through Python (calling async-JavaScript), right.  There might be a way to manually trigger it, but is definitely 100% through automation.
Some observation here:

From the log, seems RIL code didn't receive APN settings sending from Gaia. So it fails in setting up data call.
=============================
E/GeckoConsole(  446): Content JS LOG at dummy file:499 in GaiaDataLayer.setSetting: setting ril.data.enabled to true
I/Gecko   (  446): -*- RadioInterface[0]: 'ril.data.enabled' is now true
I/Gecko   (  446): -*- RadioInterface[0]: We haven't gotten completely the APN data. 
============================

Also, RIL is waiting for 'ril.data.apnSettings' but I didn't see 'ril.data.apnSettings' in the log. I saw only 'ril.data.apn' 'ril.data.user' and 'ril.data.passwd' but they are old format, not what RIL expects now.
Attached file adb.log
Unagi pvt m-c eng build.
  Gaia:     f6de05c135913f2cb790759335875bb1b3c4f9bb
  Gecko:    http://hg.mozilla.org/mozilla-central/rev/c8c9bd74cc40
  BuildID   20130819040203
  Version   26.0a1

Chunghwa SIM card works fine.
TaiwanMobile SIM card works fine, attach the log file.
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #8)
> Some observation here:
> 
> From the log, seems RIL code didn't receive APN settings sending from Gaia.
> So it fails in setting up data call.
> =============================
> E/GeckoConsole(  446): Content JS LOG at dummy file:499 in
> GaiaDataLayer.setSetting: setting ril.data.enabled to true
> I/Gecko   (  446): -*- RadioInterface[0]: 'ril.data.enabled' is now true
> I/Gecko   (  446): -*- RadioInterface[0]: We haven't gotten completely the
> APN data. 
> ============================
> 
> Also, RIL is waiting for 'ril.data.apnSettings' but I didn't see
> 'ril.data.apnSettings' in the log. I saw only 'ril.data.apn' 'ril.data.user'
> and 'ril.data.passwd' but they are old format, not what RIL expects now.

After trace the log, GaiaDataLayer set the "ril.data.apn", "ril.data.passwd" and "ril.data.user".
Bug 850555 and Bug 837488 changed the setting architecture, it might be the possible root cause.
Hi Hsin yi,

Reproduced on 

Gaia:     f6de05c135913f2cb790759335875bb1b3c4f9bb
  B-D     2013-08-19 18:36:38
Gecko:    http://hg.mozilla.org/mozilla-central/rev/c8c9bd74cc40
BuildID   20130819040203
Version   26.0a1

Try three times on gaiatest/tests/browser/test_browser_cell_data.py, 2 passed and 1 failed and log will be attached. Please let me know any other info I can provide. 

Hi Stephen,

I found that volumn will be changed to either mute or vibration in test_browser_cell_data.py, is this correct step?
Attached file pass log
Attached file fail log
Attachment #792648 - Attachment mime type: text/x-log → text/plain
(In reply to echu from comment #11)
> Hi Stephen,
> 
> I found that volumn will be changed to either mute or vibration in
> test_browser_cell_data.py, is this correct step?

Yes, it is correct.
Gaiatest will set the volume of following channels (master, content, notification, alarm, telephony, and bt_sco) to zero.
Attached failed case summary of test_browser_cell_data.py test.

starting httpd
running webserver on http://10.247.24.84:46235/
TEST-START test_browser_cell_data.py
test_browser_cell_data (test_browser_cell_data.TestBrowserCellData) ... ERROR

======================================================================
ERROR: test_browser_cell_data (test_browser_cell_data.TestBrowserCellData)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette_test.py", line 123, in run
    self.setUp()
  File "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/tests/browser/test_browser_cell_data.py", line 16, in setUp
    GaiaTestCase.setUp(self)
  File "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/gaia_test.py", line 454, in setUp
    self.cleanUp()
  File "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/gaia_test.py", line 503, in cleanUp
    self.data_layer.enable_wifi()
  File "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/gaia_test.py", line 250, in enable_wifi
    result = self.marionette.execute_async_script("return GaiaDataLayer.enableWiFi()", special_powers=True)
  File "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 642, in execute_async_script
    filename=os.path.basename(frame[0]))
  File "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 352, in _send_message
    self._handle_error(response)
  File "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 401, in _handle_error
    raise ScriptTimeoutException(message=message, status=status, stacktrace=stacktrace)
TEST-UNEXPECTED-FAIL | test_browser_cell_data.py TestBrowserCellData.test_browser_cell_data | ScriptTimeoutException: timed out
----------------------------------------------------------------------
Ran 1 test in 73.563s

FAILED (errors=1)

SUMMARY
-------
passed: 0
failed: 1
todo: 0

FAILED TESTS
(In reply to echu from comment #15)
> Attached failed case summary of test_browser_cell_data.py test.
> 
> starting httpd
> running webserver on http://10.247.24.84:46235/
> TEST-START test_browser_cell_data.py
> test_browser_cell_data (test_browser_cell_data.TestBrowserCellData) ... ERROR
> 
> ======================================================================
> ERROR: test_browser_cell_data (test_browser_cell_data.TestBrowserCellData)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File
> "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/
> marionette/marionette_test.py", line 123, in run
>     self.setUp()
>   File
> "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/tests/browser/
> test_browser_cell_data.py", line 16, in setUp
>     GaiaTestCase.setUp(self)
>   File "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/gaia_test.py", line
> 454, in setUp
>     self.cleanUp()
>   File "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/gaia_test.py", line
> 503, in cleanUp
>     self.data_layer.enable_wifi()
>   File "/home/enpei/FirefoxOS/gaia-ui-tests/gaiatest/gaia_test.py", line
> 250, in enable_wifi
>     result = self.marionette.execute_async_script("return
> GaiaDataLayer.enableWiFi()", special_powers=True)
>   File
> "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/
> marionette/marionette.py", line 642, in execute_async_script
>     filename=os.path.basename(frame[0]))
>   File
> "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/
> marionette/marionette.py", line 352, in _send_message
>     self._handle_error(response)
>   File
> "/usr/local/lib/python2.7/dist-packages/marionette_client-0.5.36-py2.7.egg/
> marionette/marionette.py", line 401, in _handle_error
>     raise ScriptTimeoutException(message=message, status=status,
> stacktrace=stacktrace)
> TEST-UNEXPECTED-FAIL | test_browser_cell_data.py
> TestBrowserCellData.test_browser_cell_data | ScriptTimeoutException: timed
> out
> ----------------------------------------------------------------------
> Ran 1 test in 73.563s
> 
> FAILED (errors=1)
> 
> SUMMARY
> -------
> passed: 0
> failed: 1
> todo: 0
> 
> FAILED TESTS

It fail at setUp() before running the test.
The setUp will try to enable Wifi, forget all networks, and then disable Wifi.
You can try to apply the patch in Bug 906595. It seems work for me.
The fix in bug 906595, which we took in https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/nightly/mozilla-central-unagi-eng/2013/08/2013-08-20-17-08-24/unagi.zip, doesn't fix the issue.

We still see it; please see http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Unagi/job/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/28/console:

ERROR: test_browser_cell_data (test_browser_cell_data.TestBrowserCellData)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette_test.py", line 123, in run
    self.setUp()
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/gaiatest/tests/browser/test_browser_cell_data.py", line 17, in setUp
    self.data_layer.connect_to_cell_data()
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/gaiatest/gaia_test.py", line 226, in connect_to_cell_data
    result = self.marionette.execute_async_script("return GaiaDataLayer.connectToCellData()", special_powers=True)
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 642, in execute_async_script
    filename=os.path.basename(frame[0]))
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 352, in _send_message
    self._handle_error(response)
  File "/var/jenkins/workspace/b2g.unagi.mozril.gaia.master.ui.run_adhoc_tests/.env/local/lib/python2.7/site-packages/marionette_client-0.5.36-py2.7.egg/marionette/marionette.py", line 401, in _handle_error
    raise ScriptTimeoutException(message=message, status=status, stacktrace=stacktrace)
TEST-UNEXPECTED-FAIL | test_browser_cell_data.py TestBrowserCellData.test_browser_cell_data | ScriptTimeoutException: timed out
----------------------------------------------------------------------
Ran 1 test in 104.065s
(In reply to Stephen Donner [:stephend] from comment #19)
> Created attachment 793370 [details]
> logcat.txt for post-906933-fix

Looks the reason for the test timed out is still the same as comment 10. I see the message "We haven't gotten completely the APN data." all over the log. That said, RIL didn't get the complete APN settings to setup data call. So, even GaiaDataLayer.setSetting: setting ril.data.enabled to true, RIL is never able to setup data call. 
======
E/GeckoConsole(  447): Content JS LOG at dummy file:499 in GaiaDataLayer.setSetting: setting ril.data.enabled to true
I/Gecko   (  447): -*- RadioInterface[0]: 'ril.data.enabled' is now true
I/Gecko   (  447): -*- RadioInterface[0]: We haven't gotten completely the APN data.
E/GeckoConsole(  447): Content JS LOG at dummy file:502 in GaiaDataLayer.setSetting/req.onsuccess: setting changed
=======

I glanced over the files and directories in "gaia/tests" but I am not sure I've seen where gaia tests assign/send 'ril.data.apnSettings' to RIL.
Hi Stephen,

I also run the test_browser_cell_data on latest build, bug 906595 is not reproducible and this issue is not found for running 11 times. Based on Askeing and HsinYi's comment 10 and comment 19, can you check the setting values in testvars.json again to see if this issue can be solved?

Gaia:     70033ba9cbac65551f68fcb3a28f7daf3105e6ff                           
B-D     2013-08-21 20:03:04                                                
Gecko:    http://hg.mozilla.org/mozilla-central/rev/b2486721572e               
BuildID   20130821050136                                                     
Version   26.0a1   

Enpei
Flags: needinfo?(frsela)
This was indeed a RIL-APN settings issue, so with Al's help (thanks, Al!), we updated as such:

-    "ril.data.apn": "wap.cingular",
-    "ril.data.user": "WAP@CINGULARGPRS.COM",
-    "ril.data.passwd": "CINGULAR1"
+    "ril.data.apnSettings": [[{
+      "apn": "wap.cingular",
+      "user": "WAP@CINGULARGPRS.COM",
+      "passwd": "CINGULAR1",
+      "types": ["default","mms","supl"]
+      }]] 

Marking as FIXED - thanks to all for your help and patience in investigation and working with us on getting the "fix" (changes) in place!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Keywords: regression
You need to log in before you can comment on or make changes to this bug.