Closed Bug 909688 Opened 11 years ago Closed 11 years ago

B2G RIL: Losing Data connectivity after a few hours

Categories

(Firefox OS Graveyard :: RIL, defect)

ARM
Gonk (Firefox OS)
defect
Not set
major

Tracking

(blocking-b2g:leo+, firefox25 wontfix, firefox26 fixed, firefox27 fixed, b2g18 fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed, b2g-v1.2 fixed)

RESOLVED FIXED
1.1 QE5
blocking-b2g leo+
Tracking Status
firefox25 --- wontfix
firefox26 --- fixed
firefox27 --- fixed
b2g18 --- fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix
b2g-v1.1hd --- fixed
b2g-v1.2 --- fixed

People

(Reporter: julienw, Assigned: jessica)

References

Details

Attachments

(8 files, 3 obsolete files)

Note: the symptoms are different than in bug 909080, and it happens way more often.

I'm losing the data connectivity after a few hours on my Unagi on a recent 1.1 Gecko.

Symptoms are:
* the H or E icons are still there
* but when I try to visit a website in the Browser (eg: google.fr), I get an error message saying that the website can not be found (a DNS problem error page, I guess).

Here is my netcfg output in this situation:

# netcfg
lo       UP                                   127.0.0.1/8   0x00000049 00:00:00:00:00:00
rmnet0   UP                               10.10.132.217/30  0x00000041 00:00:00:00:00:00
rmnet1   DOWN                                   0.0.0.0/0   0x00000000 00:00:00:00:00:00
rmnet2   DOWN                                   0.0.0.0/0   0x00000000 00:00:00:00:00:00
rmnet3   DOWN                                   0.0.0.0/0   0x00001002 ea:dc:5c:ef:78:cf
rmnet4   DOWN                                   0.0.0.0/0   0x00001002 7a:34:da:46:5d:dd
rmnet5   DOWN                                   0.0.0.0/0   0x00001002 b6:00:bc:62:54:99
rmnet6   DOWN                                   0.0.0.0/0   0x00001002 b6:9a:7b:80:6e:3a
rmnet7   DOWN                                   0.0.0.0/0   0x00001002 fe:b7:c0:c7:1a:2f
tunl0    DOWN                                   0.0.0.0/0   0x00000080 00:00:00:00:00:00
sit0     DOWN                                   0.0.0.0/0   0x00000080 00:00:00:00:00:00
ip6tnl0  DOWN                                   0.0.0.0/0   0x00000080 00:00:00:00:00:00
wlan0    UP                                     0.0.0.0/0   0x00001003 48:28:2f:f9:b8:74

Here is my /proc/net/route :

# cat /proc/net/route                                           
Iface	Destination	Gateway 	Flags	RefCnt	Use	Metric	Mask		MTU	Window	IRTT
rmnet0	6E0AA8C0	DA840A0A	0007	0	0	0	FFFFFFFF	0	0	0

We notice especially that I don't have a default route.

Here are the output after a reboot and everything works:

# netcfg                                                        
lo       UP                                   127.0.0.1/8   0x00000049 00:00:00:00:00:00
rmnet0   UP                                 10.64.85.59/29  0x00000041 00:00:00:00:00:00
rmnet1   DOWN                                   0.0.0.0/0   0x00000000 00:00:00:00:00:00
rmnet2   DOWN                                   0.0.0.0/0   0x00000000 00:00:00:00:00:00
rmnet3   DOWN                                   0.0.0.0/0   0x00001002 36:7d:ce:a1:e5:2a
rmnet4   DOWN                                   0.0.0.0/0   0x00001002 02:17:2c:d7:0c:26
rmnet5   DOWN                                   0.0.0.0/0   0x00001002 6a:e8:6f:83:e2:57
rmnet6   DOWN                                   0.0.0.0/0   0x00001002 be:52:59:33:93:d6
rmnet7   DOWN                                   0.0.0.0/0   0x00001002 0a:2c:e1:37:5f:c8
tunl0    DOWN                                   0.0.0.0/0   0x00000080 00:00:00:00:00:00
sit0     DOWN                                   0.0.0.0/0   0x00000080 00:00:00:00:00:00
ip6tnl0  DOWN                                   0.0.0.0/0   0x00000080 00:00:00:00:00:00
wlan0    UP                                     0.0.0.0/0   0x00001003 48:28:2f:f9:b8:74
root@android:/ # cat /proc/net/route                                           
Iface	Destination	Gateway 	Flags	RefCnt	Use	Metric	Mask		MTU	Window	IRTT
rmnet0	00000000	3955400A	0003	0	0	0	00000000	0	0	0
rmnet0	3855400A	00000000	0001	0	0	0	F8FFFFFF	0	0	0
rmnet0	6E0AA8C0	3955400A	0007	0	0	0	FFFFFFFF	0	0	0


I can notice that the rmnet0 IP looks different (but that's maybe normal) but especially now I have a default route.


Vicamo, could this be related to the fix for bug 905955 ?
I don't know if this bug happens on m-c.
Flags: needinfo?(vyang)
Note: compared to bug 909080 I can send SMS and initiate a call.
leo because I see this on my Unagi (with Moz RIL) on 1.1, and this seems to be a release-blocker to me.
Severity: normal → major
blocking-b2g: --- → leo?
QA, please check if this is consistently reproducible.
Keywords: qawanted
Ken, please advise from a RIL standpoint.
Flags: needinfo?(kchang)
Note that I didn't see this again since I'm back in Paris (I was in another city when I filed this bug). I think I didn't update Gecko since then though.
QA Contact: nkot
Attached file logcat
(In reply to Preeti Raghunath(:Preeti) from comment #3)
> QA, please check if this is consistently reproducible.

still happening on Unagi 1.1 (Moz RIL)
- losing the data connectivity after a few hours
- having E icon unable to browse web
- able to browse after rebooting device
- SMS and calls are not affected

Unagi
Build ID: 20130903041201
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/9da3739ce7d7
Gaia: 8dc90703f4151d6f2a0decede0ee702f425a3a38
Platform Version: 18.1
Keywords: qawanted
nkot,

Thanks for testing on Unagi. Can you please verify if this on a leo device as well?
Flags: needinfo?(nkot)
reproduces on Leo too!
observe exactly the same behavior as on Unagi

Build ID: 20130904041204
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/307824edadd7
Gaia: d0a415bbf23e5d01c2b287d9fca708e167cfe70d
Platform Version: 18.1
RIL Version: 01.01.00.019.212 
(repros on both, MOZ and COM_RIL)
Flags: needinfo?(nkot)
nkot> strangely this did not happen again for me since I got back to Paris, do you have an idea of what you did to trigger this ? (I don't have a clue either)
Jessica, do you mind taking a look?
Flags: needinfo?(vyang)
Flags: needinfo?(kchang)
Hi nkot,

Since RIL log is not enabled by default, can you use MOZ RIL and follow the instructions in Bug 811605 Comment 5 to get logs again?

In addition, when this is reproduced, can you note down the output of the following three commands?
$ adb shell cat /proc/net/route
$ adb shell netcfg
$ adb shell getprop

Thank you!
Flags: needinfo?(nkot)
Attached file commands output
Jessica, 
please find the attachment with the output of those commands you've asked for

(In reply to Julien Wajsberg [:julienw] from comment #9)
> nkot> strangely this did not happen again for me since I got back to Paris,
> do you have an idea of what you did to trigger this ? (I don't have a clue
> either)

Julien,
we've been dealing with network issues in here a lot, that is the only thing I can think off, I don't do anything special that might trigger this issue...
Flags: needinfo?(nkot)
Hi nkot,

From the commands output, there is no default route, so the data connection would not work.
But I still need the logcat with RIL log enabled to see how did this happen.
Could you use MOZ RIL and follow the instructions in Bug 811605 Comment 5 to get logcat again?

BTW, do you see this issue in m-c builds?

Thank you.
Need info for comment 13
Flags: needinfo?(nkot)
Attached patch enable-log.patchSplinter Review
nkot,

It would be even better if you could just apply the attached patch, so we can check network manager logs too.

Thank you.
leo+ this is reproducible.

Jessica,

Can you please look into the bug?
blocking-b2g: leo? → leo+
Flags: needinfo?(jjong)
Hi Jessica,
attaching the logcat with the RIL log enabled. 
Unfortunately, was unable to apply a patch from comment 15, we don't do it here (can anyone build a build with that patch so I could install it?)

Some extra findings:
- it seems like the issue reproduces only when device goes in to a sleep mode (does not reproduce when screen timeout is set to never)
- does not reproduce while device is pluged in

I will test on m-r build on Monday, will also try Buri.
Flags: needinfo?(nkot)
Hi nkot,

Sorry about this, but the log in attachment 801056 [details] has only 1 minute of duration (13:50:22 ~ 13:51:14).
Could you help start recording the log from the very beginning, when device starts up?

And thank you for your extra findings. I will try to reproduce this myself with the conditions mentioned, as I usually leave my phone plugged in.

Jessica
Flags: needinfo?(jjong)
I think I might know what's happening here... I haven't been able to reproduce it yet, but I guess the root cause is similar to Bug 908603.

When network drops our data connection for some reason, RIL will receive an UNSOL_DATA_CALL_LIST_CHANGED with the data connection link down. The data connection state is updated but not removed from |currentDataCalls| [1].
When |updateRILNetworkInterface()| in |RadioInterface| [2] is called, a new data connection will be set up. This new data connection will have the same cid as the previous one, which is already in |currentDataCalls|, so it will be treated as a data call state change, therefore ip, gw and other info are not updated [1].

We have been having many issues because of this, I think we need to discuss how to handle all kinds of data call list changes properly.

nkot, please still conduct the planned tests, so that I can verify my assumptions.
Thank you.


[1] http://mxr.mozilla.org/mozilla-central/source/dom/system/gonk/ril_worker.js#3703
[2] http://mxr.mozilla.org/mozilla-central/source/dom/system/gonk/RadioInterfaceLayer.js#1475
Assignee: nobody → jjong
Needinfo for comment 19
Flags: needinfo?(nkot)
Jessica, I was trying to get a logcat starting when device starts up, but as I mentioned above, the issue does not reproduce if device is plugged in... so I am unable to get a log showing how device is losing connection.

sorry, that i can't help you much here...

I've retested on two Leos today (1-plugged, 2-unplugged)
1)Unplugged device, first check in 20 min --> lost connection, restarted device, checked browser, worked OK, left device idle, rechecked browser in 30 min --> lost connection, and so on.
2)Plugged-in device worked well for 1,5 hour, after I unplugged it and left idle for 30 min this device lost connection as well.

Have tested on m-r build, the issue does NOT reproduce. Was able to browse web using cell data every time I return, every 30 min withing two hours.

Build ID: 20130906040204
Gecko: http://hg.mozilla.org/mozilla-central/rev/ab5f29823236
Gaia: 94e5f269874b02ac0ea796b64ab995fce9efa4b3
Platform Version: 26.0a1
Flags: needinfo?(nkot)
nkot, it's okay, I was able to reproduce it leaving my phone idle and unplugged overnight (m-c build).
This happens when network drops our data connection, still not sure how it is related to the phone plugged/unplugged.
Thanks for your help. :)
This is a proposed patch to handle data call list changes.

- If network drops our data connection (either the received data call state is down or does not exist in the received data call list anymore),
we delete it from |currentDataCalls| and wait for it to reconnect when |updateRILNetworkInterface()| is called.

- If the received ip address or network interface has changed, we consider it a big change, so we deactivate data call
and wait for it to reconnect when |updateRILNetworkInterface()| is called. 
Other changes are consider minor, we just update the fields and send out a "datacallstatechange". (|RILNetworkInterface| needs to be modified to reflect these changes.)

- If we receive an unexpected data call, we should ignore it and not add it to |currentDataCalls|, otherwise it may cause other issues in the future due to the lack of apn, user, passwd and other info.


Please feel free to give comments and feedback.
Still need to find a way to test this...
Attachment #802200 - Flags: feedback?(vyang)
Just noticed bug 894831 which are some more information too.
Component: General → RIL
Summary: Losing Data connectivity after a few hours → B2G RIL: Losing Data connectivity after a few hours
Comment on attachment 802200 [details] [diff] [review]
proposed patch for handling data call list changes.

Review of attachment 802200 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the late review.  This is an awesome catch! Thank you :)

::: dom/system/gonk/ril_worker.js
@@ +3661,5 @@
>      }
>      this.sendChromeMessage(message);
>    },
>  
> +  _compareDataCallLink: function _compareDataCallLink(source, target) {

Let's bail-out early:

  if (source.ifname != target.ifname ||
      source.ipaddr != target.ipaddr ||
      source.gw != target.gw) {
    return false;
  }

  // Compare <datacall>.dns.
  let sdns = source.dns, tdns = target.dns;
  if (sdns.length != tdns.length) {
    return false;
  }

  ...

@@ +3729,5 @@
>          currentDataCall.active = updatedDataCall.active;
>          currentDataCall.state = updatedDataCall.state;
>          currentDataCall.rilMessageType = "datacallstatechange";
>          this.sendChromeMessage(currentDataCall);
> +      } else {

Same here.  Don't use nested if-else if that's avoidable.

@@ +3753,5 @@
>      }
>  
>      for each (let newDataCall in datacalls) {
>        if (!newDataCall.ifname) {
>          continue;

Please help for refactoring this loop a little bit more:

  for each (let newDataCall in datacalls) {
    if (!newDataCall.ifname) {
      continue;
    }

    if (!newDataCallOptions) {
      if (DEBUG) ...
      continue;
    }

    ...
Attachment #802200 - Flags: feedback?(vyang) → feedback+
Addressed comments in Comment 26 and modified RadioInterfaceLayer.js to reflect data link minor changes.

Try: https://tbpl.mozilla.org/?tree=Try&rev=0a48c783e654
Attachment #802200 - Attachment is obsolete: true
Attachment #805816 - Flags: review?(vyang)
Attached file PR to external/qemu
Modified emulator not to auto attach data registration when activating data call, instead return error if data registration is unregistered. 
Otherwise, data registration will always be auto recovered after entering "gsm data unregistered" command in console when data connection is enabled.
Attachment #805818 - Flags: review?(vyang)
Comment on attachment 805816 [details] [diff] [review]
Part 1: patch for handling data call state changes properly

Review of attachment 805816 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/system/gonk/RadioInterfaceLayer.js
@@ +3246,2 @@
>      if (this.state == datacall.state) {
> +      if (datacall.state == GECKO_NETWORK_STATE_CONNECTED) {

nit: always bail-out early

@@ +3261,5 @@
> +        if (changed) {
> +          if (DEBUG) this.debug("Notify for data call minor changes.");
> +          Services.obs.notifyObservers(this,
> +                                       kNetworkInterfaceStateChangedTopic,
> +                                       null);

I think I just get another reason to move host routes management into per network interface.  When we broadcast a kNetworkInterfaceStateChangedTopic event and NetworkManager receives instantly, it adds interface host routes AGAIN and do a few things more.  Not a good idea really.  You may solve this problem by[1]:

  // Add host route for data calls
  if (network.type == Ci.nsINetworkInterface.NETWORK_TYPE_MOBILE ||
      network.type == Ci.nsINetworkInterface.NETWORK_TYPE_MOBILE_MMS ||
      network.type == Ci.nsINetworkInterface.NETWORK_TYPE_MOBILE_SUPL) {
    this.removeHostRoutes(network.name);
    this.addHostRoute(network);
  }

and implement |NetworkManager.removeHostRoutes| with |ifc_remove_host_routes|[2].

[1]: http://mxr.mozilla.org/mozilla-central/source/dom/system/gonk/NetworkManager.js#232
[2]: https://github.com/mozilla-b2g/platform_system_core/blob/master/include/netutils/ifc.h#L51
     https://github.com/mozilla-b2g/platform_system_core/blob/master/libnetutils/ifc_utils.c#L650
Attachment #805816 - Flags: review?(vyang)
Comment on attachment 805818 [details]
PR to external/qemu

Looks nice but one nit to do.  Thank you!  Will merge with that nit addressed.
Attachment #805818 - Flags: review?(vyang) → review+
(In reply to Vicamo Yang [:vicamo][:vyang] from comment #30)
> Comment on attachment 805818 [details]
> PR to external/qemu
> 
> Looks nice but one nit to do.  Thank you!  Will merge with that nit
> addressed.

Thank you.
PR has been updated with nit addressed.
This patch addresses comments in Comment 29.
Thanks for the review comments.

Try: https://tbpl.mozilla.org/?tree=Try&rev=a06ac2b0ecaa
Attachment #805816 - Attachment is obsolete: true
Attachment #806490 - Flags: review?(vyang)
Attachment #806490 - Flags: review?(vyang) → review+
Add test cases for data loss.
These test cases can only test if data call can be successfully recovered after network drop, but can not verify if the link is workable or not.

Note that these test cases depend on PR 48 to external/qemu (attachment 805818 [details]) to work.

Try: https://tbpl.mozilla.org/?tree=Try&rev=cfb5354f9a2a
Attachment #806595 - Flags: review?(vyang)
Attachment #806595 - Flags: review?(vyang) → review+
(In reply to Vicamo Yang [:vicamo][:vyang] from comment #34)
> https://hg.mozilla.org/integration/b2g-inbound/rev/c4efc4f1eb51

There are two landed patches in total.  Thank you.
https://hg.mozilla.org/integration/b2g-inbound/rev/98e7d9fb3ae7
We'll need a b2g18 patch too, it doesn't apply cleanly (although the conflicts are easy to resolve). I did one but since I haven't made the original patch I don't want to do something wrong.
https://hg.mozilla.org/mozilla-central/rev/c4efc4f1eb51
https://hg.mozilla.org/mozilla-central/rev/98e7d9fb3ae7
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → 1.2 FC (16sep)
Needs a branch-specific patch for b2g18 uplift.
Flags: needinfo?(jjong)
Flags: needinfo?(vyang)
Patch for b2g18.

Should I leave the patch for test cases out? there is no "test_mobile_data_connection.js" in b2g18.
Attachment #808417 - Flags: review?(vyang)
Flags: needinfo?(jjong)
Note: I think b2g18 try was quite broken at one point, not sure it was fixed.
amend commit message: remove "Part 1" as there is no Part 2.
Attachment #808417 - Attachment is obsolete: true
Attachment #808417 - Flags: review?(vyang)
Attachment #808448 - Flags: review?(vyang)
Comment on attachment 808448 [details] [diff] [review]
patch for handling data call state changes properly (b2g18)

Review of attachment 808448 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you :)
Attachment #808448 - Flags: review?(vyang) → review+
does it impact commercial RIL? or commercial RIL needs to consider the same fix? thanks
Flags: needinfo?(jjong)
Based on Bug 894831, commercial RIL has already fixed this issue, but we can ask them to double confirm.
Flags: needinfo?(jjong)
Flags: needinfo?(pgravel)
Flags: needinfo?(pgravel)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: