Closed Bug 894879 Opened 7 years ago Closed 6 years ago

SimplePush: Adaptive ping for WebSocket

Categories

(Core :: DOM: Push Notifications, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- wontfix
firefox32 --- wontfix
firefox33 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: willyaranda, Assigned: willyaranda)

References

Details

(Keywords: feature)

Attachments

(1 file, 9 obsolete files)

13.18 KB, patch
willyaranda
: review+
Details | Diff | Splinter Review
Different networks have different timeouts for sockets. And so 3G cellular networks.

We have a set-on-stone timeout on services.push.pingInterval of 30 minutes. Some networks has less timeout, so we end up with a closed connection and we need to initiate everything again (TCP socket, SSLhandshake, hello message…).

It should be better just to adapt the ping to the optimal interval of the network. Similar to the _reconnectAfterBackoff function that Nikhil did.

Assigning to Mario.
No longer blocks: simple-push-b2g
Depends on: simple-push-b2g
Component: General → DOM: Push Notifications
Product: Boot2Gecko → Core
Version: unspecified → Trunk
Attached patch WIP (obsolete) — Splinter Review
Assignee: mplmarin → willyaranda
Status: NEW → ASSIGNED
Attachment #806586 - Flags: feedback?(nsm.nikhil)
Attachment #806586 - Flags: feedback?(frsela)
Attachment #806586 - Attachment is patch: true
Attachment #806586 - Attachment mime type: text/x-patch → text/plain
Comment on attachment 806586 [details] [diff] [review]
WIP

Cool, but in my tests I forced network errors so an invalid timeout was calculated.

An open though: "Should we consider an invalid ping timeout always the connection is closed?"

I mean, if the connection is closed for any external cause (out of coverage, network error, ...) we can consider an invalid timeout when is a valid one ...

I think we should consider an invalid time if 2 or more errors happens.

Thoughs?
Attachment #806586 - Flags: feedback?(frsela) → feedback+
Comment on attachment 806586 [details] [diff] [review]
WIP

Review of attachment 806586 [details] [diff] [review]:
-----------------------------------------------------------------

right direction

::: dom/push/src/PushService.jsm
@@ +571,5 @@
>    _reconnectAfterBackoff: function() {
>      debug("reconnectAfterBackoff()");
>  
> +    //Calculate new ping interval
> +    this._calculateAdaptivePing(true);

Need to think a bit about if this is the right place to put the call, but seems ok.

@@ +589,5 @@
> +   * We need to calculate a new ping based on:
> +   *  1) Latest good ping
> +   *  3) A safe gap between 1) and the calculated new ping (1 minute)
> +   *
> +   * This is for 3G networks, whose connections' keepalives differ broadly:

put a ', for example'

@@ +606,5 @@
> +   * The value is saved on services.push.pingInterval
> +   * @param wsWasDown [Boolean] if the WebSocket was closed or it is still alive
> +   *
> +   */
> +  _calculateAdaptivePing: function(wsWasDown) {

s/wsWasDown/wsWentDown

@@ +608,5 @@
> +   *
> +   */
> +  _calculateAdaptivePing: function(wsWasDown) {
> +    debug('_calculateAdaptivePing()');
> +    if (prefs.get('adaptivePing.calculated')) {

Call this 'reachedUpperBound' rather than calculated. Then drop the check here.

If calculated is set, and the network keep alive decreases, we need to recalculate, so the wsWentDown = true case should still execute.

When the ws goes down, we know we increased the ping value too much, but we have a last known good value. If on decreasing the ping value, we come within delta of the last good, we know we should stick to this ping interval, so we set 'reachedUpperBound' to true.

You also want to unset reachedUpperBound whenever the network changes (network-active-changed observer)

@@ +618,5 @@
> +    var lastTriedPing = prefs.get('pingInterval');
> +    if (wsWasDown) {
> +      debug('The WebSocket was disconnected, trying a lower ping');
> +      //Latest ping tried is invalid, we need to lower the limit to limit / 1.5
> +      var lastGoodPing = prefs.get('adaptivePing.lastGood') || 1;

I hope you are not considering the default lastGoodPingInterval as 1ms. You'd kill the device and the server with such rapid pings :)

@@ +619,5 @@
> +    if (wsWasDown) {
> +      debug('The WebSocket was disconnected, trying a lower ping');
> +      //Latest ping tried is invalid, we need to lower the limit to limit / 1.5
> +      var lastGoodPing = prefs.get('adaptivePing.lastGood') || 1;
> +      nextPingInterval = lastTriedPing / 1.5;

s/lastTriedPing/lastTriedPingInterval?

@@ +623,5 @@
> +      nextPingInterval = lastTriedPing / 1.5;
> +
> +      // If the gap is less than the preferences gap, we are finished.
> +      if (nextPingInterval - lastTriedPing <
> +                             (prefs.get('adaptivePing.gap') || 60000)) {

It should be 'nextPingInterval - lastGoodPingInterval'. That is the condition that you want to achieve to stop calculating new ping intervals. If the new ping interval and the last known good one are within a delta, we can maintain this interval.

With the current code, and a network keep alive of 20min, a lastGoodPing of 20min, a lastTriedPing of 20min and a upper bound nextPingInterval of 30min, see what happens, we end up oscillating between 20min pings and disconnection.

Nit: I'd phrase this as

"If the new ping interval is close to the last good one, we are near optimum, so stop calculating."

@@ +630,5 @@
> +        prefs.set('adaptivePing.calculated', true);
> +      }
> +
> +    } else {
> +      debug('The WebSocket is still up');

Check reachedUpperBound here.

@@ +631,5 @@
> +      }
> +
> +    } else {
> +      debug('The WebSocket is still up');
> +      prefs.set('adaptivePing.lastGood', lastTriedPing);

just call the pref 'lastGoodPingInterval'

@@ +632,5 @@
> +
> +    } else {
> +      debug('The WebSocket is still up');
> +      prefs.set('adaptivePing.lastGood', lastTriedPing);
> +      nextPingInterval = lastTriedPing * 2;

1) Need a max bound of 30min.

Doubling it sounds drastic. Say the device is on a network which has a keep alive of 20min. It pings on the 19.99min and succeeds, so it goes to ~40min (bound to 30min). Now the next time it will lose the connection and reduce by 1.5 so coming down to 20min. Now it will again go to 30, and basically keep oscillating, while also losing the connection.

@@ +1403,5 @@
>  
>      this._waitingForPong = false;
>  
> +    //This is a pong, let's calculate next pingInterval
> +    if (message === '{}') {

The mozilla server responds with a messageType "ping" message
Attachment #806586 - Flags: feedback?(nsm.nikhil) → feedback+
Also, can you make this an 'optional' feature with a 'adaptive.enabled' pref. We may not always want this on.
Summary: [SimplePush] Adaptative ping for WebSocket → SimplePush: Adaptive ping for WebSocket
(In reply to Nikhil Marathe [:nsm] from comment #4)
> Also, can you make this an 'optional' feature with a 'adaptive.enabled'
> pref. We may not always want this on.

Why not?

The protocol (which can be improved) tries to find the longest Keep Alive for the current network. Why to fix it?

Fixing it is a very bad solution, for example, now you have 28 min. which fails with some 3G proxies and ADSL connections. Only in the 3G network prepared to it can handle this long timeout.

Also in TEF Spain, we tested we can PING each 6 hours ! so why to send each 28min? is overkilling.

So I don't see any situation in which fixing the timeout is necessary :p
(In reply to Nikhil Marathe [:nsm] from comment #3)
> 
> 1) Need a max bound of 30min.
> 
> Doubling it sounds drastic. Say the device is on a network which has a keep
> alive of 20min. It pings on the 19.99min and succeeds, so it goes to ~40min
> (bound to 30min). Now the next time it will lose the connection and reduce
> by 1.5 so coming down to 20min. Now it will again go to 30, and basically
> keep oscillating, while also losing the connection.
> 

This is a WIP and a first draft. As I talked with Guillermo yesterday, instead fix the max bound, I think an inverted parabolic function is better so increase faster with lower timeouts and reduce the increase when we're going to longer timeouts.

Fix to 30min is not ok since some networks allows HOURS so we can go to loooooonger times which is great !

Also, I talked with Guillermo that consider the max. timeout with the first error is not OK since the error cann't be produced by the max time (lost coverage, network error, ...) so almost 2 errors at the same timeout should be considered the network limit ;)

> @@ +1403,5 @@
> >  
> >      this._waitingForPong = false;
> >  
> > +    //This is a pong, let's calculate next pingInterval
> > +    if (message === '{}') {
> 
> The mozilla server responds with a messageType "ping" message

This is not the agreed message. When this message was changed?

In Madrid we decided to PING-PONG with {} in both ways to reduce payload. Why this is changed? Why not announced?

Please, don't change the protocol in unilatteral way, we need to maintain a compatible protocol ;)
(In reply to Nikhil Marathe [:nsm] from comment #3)
> @@ +608,5 @@
> > +   *
> > +   */
> > +  _calculateAdaptivePing: function(wsWasDown) {
> > +    debug('_calculateAdaptivePing()');
> > +    if (prefs.get('adaptivePing.calculated')) {
> 
> Call this 'reachedUpperBound' rather than calculated. Then drop the check
> here.
> 
> If calculated is set, and the network keep alive decreases, we need to
> recalculate, so the wsWentDown = true case should still execute.
> 
> When the ws goes down, we know we increased the ping value too much, but we
> have a last known good value. If on decreasing the ping value, we come
> within delta of the last good, we know we should stick to this ping
> interval, so we set 'reachedUpperBound' to true.
> 
> You also want to unset reachedUpperBound whenever the network changes
> (network-active-changed observer)

Our idea is to store different pings for different networks, having a DB for mcc-mnc and wifi MAC addresses too.

So we are going to calculate pings just for the current network, but, on change, we will have a saved adaptive ping on the new network. I think this is a nice addition, since me, for example, is changing constantly between wifi and 3G, both at home and at work.

Thoughts?

> 
> @@ +623,5 @@
> > +      nextPingInterval = lastTriedPing / 1.5;
> > +
> > +      // If the gap is less than the preferences gap, we are finished.
> > +      if (nextPingInterval - lastTriedPing <
> > +                             (prefs.get('adaptivePing.gap') || 60000)) {
> 
> It should be 'nextPingInterval - lastGoodPingInterval'. That is the
> condition that you want to achieve to stop calculating new ping intervals.
> If the new ping interval and the last known good one are within a delta, we
> can maintain this interval.

Fixed.


> 
> @@ +632,5 @@
> > +
> > +    } else {
> > +      debug('The WebSocket is still up');
> > +      prefs.set('adaptivePing.lastGood', lastTriedPing);
> > +      nextPingInterval = lastTriedPing * 2;
> 
> 1) Need a max bound of 30min.

Why do you think that we should cap it to 30 minutes? As Fernando said, this is really different between networks and the idea of the ping is to maximize the timing for when the WS is not closed and minimize battery usage.

We have discovered (with great surprise!) that the actual timeout of a socket in TEF Spain without the network proxy is around 6 hours, which is way better than the 28 minutes that network group gave us.

> 
> Doubling it sounds drastic. Say the device is on a network which has a keep
> alive of 20min. It pings on the 19.99min and succeeds, so it goes to ~40min
> (bound to 30min). Now the next time it will lose the connection and reduce
> by 1.5 so coming down to 20min. Now it will again go to 30, and basically
> keep oscillating, while also losing the connection.

Agree, this was a POC, and we need a better algorithm to calculate the nextPing (both up and down).

> 
> @@ +1403,5 @@
> >  
> >      this._waitingForPong = false;
> >  
> > +    //This is a pong, let's calculate next pingInterval
> > +    if (message === '{}') {
> 
> The mozilla server responds with a messageType "ping" message

Yes, I see this on the code:
https://github.com/jrconlin/pushgo/blob/e7578aa5d67616309086c9556443c6a3b302559d/src/mozilla.org/simplepush/worker.go#L624

We are sending back a plain '{}' (which in fact is shorter, and I think we agree on some bug about this). I know the client does not handle anything else than:

    let handlers = ["Hello", "Register", "Notification"];

but we should be consistent on the protocol.
(In reply to Fernando R. Sela (no CC, needinfo please) [:frsela] from comment #6)
> (In reply to Nikhil Marathe [:nsm] from comment #3)
> > 
> > 1) Need a max bound of 30min.
> > 
> > Doubling it sounds drastic. Say the device is on a network which has a keep
> > alive of 20min. It pings on the 19.99min and succeeds, so it goes to ~40min
> > (bound to 30min). Now the next time it will lose the connection and reduce
> > by 1.5 so coming down to 20min. Now it will again go to 30, and basically
> > keep oscillating, while also losing the connection.
> > 
> 
> This is a WIP and a first draft. As I talked with Guillermo yesterday,
> instead fix the max bound, I think an inverted parabolic function is better
> so increase faster with lower timeouts and reduce the increase when we're
> going to longer timeouts.
> 
> Fix to 30min is not ok since some networks allows HOURS so we can go to
> loooooonger times which is great !

Alright, but I'd like *some* upper bound, say 2 hours or so. With that I don't think the battery impact will be discernible. I've yet to see a network that has actually kept that connection alive after hours, and TCP can be finicky about reporting errors on time.

> 
> Also, I talked with Guillermo that consider the max. timeout with the first
> error is not OK since the error cann't be produced by the max time (lost
> coverage, network error, ...) so almost 2 errors at the same timeout should
> be considered the network limit ;)

Makes sense as long as the code doesn't get too complicated.

> 
> > @@ +1403,5 @@
> > >  
> > >      this._waitingForPong = false;
> > >  
> > > +    //This is a pong, let's calculate next pingInterval
> > > +    if (message === '{}') {
> > 
> > The mozilla server responds with a messageType "ping" message
> 
> This is not the agreed message. When this message was changed?

Afaik this made the server code straight-forward, but I'd suggest you and jrconlin figure this out on IRC. The current client code really doesn't care, so we won't break anything by changing one of the implementations.
(In reply to Guillermo López (:willyaranda) from comment #7)
> (In reply to Nikhil Marathe [:nsm] from comment #3)
> > @@ +608,5 @@
> > > +   *
> > > +   */
> > > +  _calculateAdaptivePing: function(wsWasDown) {
> > > +    debug('_calculateAdaptivePing()');
> > > +    if (prefs.get('adaptivePing.calculated')) {
> > 
> > Call this 'reachedUpperBound' rather than calculated. Then drop the check
> > here.
> > 
> > If calculated is set, and the network keep alive decreases, we need to
> > recalculate, so the wsWentDown = true case should still execute.
> > 
> > When the ws goes down, we know we increased the ping value too much, but we
> > have a last known good value. If on decreasing the ping value, we come
> > within delta of the last good, we know we should stick to this ping
> > interval, so we set 'reachedUpperBound' to true.
> > 
> > You also want to unset reachedUpperBound whenever the network changes
> > (network-active-changed observer)
> 
> Our idea is to store different pings for different networks, having a DB for
> mcc-mnc and wifi MAC addresses too.

Sounds like too much work for the Wi-Fi bits. Let's keep it simple and not store things we don't need to. A couple of extra pings is not a big deal.


> 
> 
> > 
> > @@ +632,5 @@
> > > +
> > > +    } else {
> > > +      debug('The WebSocket is still up');
> > > +      prefs.set('adaptivePing.lastGood', lastTriedPing);
> > > +      nextPingInterval = lastTriedPing * 2;
> > 
> > 1) Need a max bound of 30min.
> 
> Why do you think that we should cap it to 30 minutes? As Fernando said, this
> is really different between networks and the idea of the ping is to maximize
> the timing for when the WS is not closed and minimize battery usage.
> 
> We have discovered (with great surprise!) that the actual timeout of a
> socket in TEF Spain without the network proxy is around 6 hours, which is
> way better than the 28 minutes that network group gave us.

see previous comment. Are you saying that you could reproduce a case where you register for a push notification, then after sending no pings for 5-6 hours, you were able to send a instant notification to the device? What will happen if that same cell tower is under load due to a big sports event or something?

For example: I work at the Bernabeu as a security guard and the morning before the game I come in to work and my phone establishes a websocket. Over the morning it keeps increasing ping times and reaches 6 hours. Then in the evening the game starts and 80000 people come in and now the cell tower can no longer hold on to all these connections, so it drops my connection, but because of the 6 hour ping the device doesn't realise it.
(In reply to Nikhil Marathe [:nsm] from comment #8)
> (In reply to Fernando R. Sela (no CC, needinfo please) [:frsela] from
> comment #6)
> > (In reply to Nikhil Marathe [:nsm] from comment #3)
> > > 
> > > 1) Need a max bound of 30min.
> > > 
> > > Doubling it sounds drastic. Say the device is on a network which has a keep
> > > alive of 20min. It pings on the 19.99min and succeeds, so it goes to ~40min
> > > (bound to 30min). Now the next time it will lose the connection and reduce
> > > by 1.5 so coming down to 20min. Now it will again go to 30, and basically
> > > keep oscillating, while also losing the connection.
> > > 
> > 
> > This is a WIP and a first draft. As I talked with Guillermo yesterday,
> > instead fix the max bound, I think an inverted parabolic function is better
> > so increase faster with lower timeouts and reduce the increase when we're
> > going to longer timeouts.
> > 
> > Fix to 30min is not ok since some networks allows HOURS so we can go to
> > loooooonger times which is great !
> 
> Alright, but I'd like *some* upper bound, say 2 hours or so. With that I
> don't think the battery impact will be discernible. I've yet to see a
> network that has actually kept that connection alive after hours, and TCP
> can be finicky about reporting errors on time.

The thing is that we have seen that Movistar Spain has the connection opened for more than 2 hours.

09-17 20:49:52.508 I/Gecko   (  464): -*- PushService.jsm: Stopped existing alarm 15
09-17 20:49:52.758 I/Gecko   (  464): -*- PushService.jsm: Set alarm 10000 in the future 16
09-17 20:49:53.379 I/Gecko   (  464): -*- PushService.jsm: wsOnMessageAvailable() {}
09-17 20:49:53.379 I/Gecko   (  464): -*- PushService.jsm: _calculateAdaptivePing()
09-17 20:49:53.379 I/Gecko   (  464): -*- PushService.jsm: The WebSocket is still up
09-17 20:49:53.389 I/Gecko   (  464): -*- PushService.jsm: Setting the nextPing to 7680000
09-17 20:49:53.399 I/Gecko   (  464): -*- PushService.jsm: Stopped existing alarm 16
09-17 20:49:53.409 I/Gecko   (  464): -*- PushService.jsm: messageType not a string undefined
09-17 20:49:53.569 I/Gecko   (  464): -*- PushService.jsm: Set alarm 7680000 in the future 17
09-17 22:57:53.412 I/Gecko   (  464): -*- PushService.jsm: Stopped existing alarm 17
09-17 22:57:53.692 I/Gecko   (  464): -*- PushService.jsm: Set alarm 10000 in the future 18
09-17 22:57:54.273 I/Gecko   (  464): -*- PushService.jsm: wsOnMessageAvailable() {}
09-17 22:57:54.273 I/Gecko   (  464): -*- PushService.jsm: _calculateAdaptivePing()
09-17 22:57:54.273 I/Gecko   (  464): -*- PushService.jsm: The WebSocket is still up
09-17 22:57:54.283 I/Gecko   (  464): -*- PushService.jsm: Setting the nextPing to 15360000
09-17 22:57:54.293 I/Gecko   (  464): -*- PushService.jsm: Stopped existing alarm 18
09-17 22:57:54.303 I/Gecko   (  464): -*- PushService.jsm: messageType not a string undefined
09-18 04:31:05.001 I/Gecko   (  464): -*- PushService.jsm: wsOnMessageAvailable() {}
09-18 04:31:05.001 I/Gecko   (  464): -*- PushService.jsm: _calculateAdaptivePing()
09-18 04:31:05.001 I/Gecko   (  464): -*- PushService.jsm: The WebSocket is still up
09-18 04:31:05.001 I/Gecko   (  464): -*- PushService.jsm: Setting the nextPing to 30720000
09-18 04:31:05.001 I/Gecko   (  464): -*- PushService.jsm: messageType not a string undefined
09-18 04:31:05.181 I/Gecko   (  464): -*- PushService.jsm: Set alarm 10000 in the future 23
09-18 04:31:05.181 I/Gecko   (  464): -*- PushService.jsm: Stopped existing alarm 23
09-18 04:31:05.451 I/Gecko   (  464): -*- PushService.jsm: Set alarm 30720000 in the future 24

So, as you can see 15360000 still works, which is 4.27h!! Then, we had a dropped connection around 5:45h later.

I have tested that again with a gap of more of two hours, and it was still working fine, per Antonio's advise, I will put a tcpdump on the phone and see if the connection is dropped by the 2 hours of inactivity (the kernel should do that) but not sent to the application layer and then reopen again (I don't think this could be possible but…)

> > 
> > > @@ +1403,5 @@
> > > >  
> > > >      this._waitingForPong = false;
> > > >  
> > > > +    //This is a pong, let's calculate next pingInterval
> > > > +    if (message === '{}') {
> > > 
> > > The mozilla server responds with a messageType "ping" message
> > 
> > This is not the agreed message. When this message was changed?
> 
> Afaik this made the server code straight-forward, but I'd suggest you and
> jrconlin figure this out on IRC. The current client code really doesn't
> care, so we won't break anything by changing one of the implementations.

Ok, I will ping him.
(In reply to Nikhil Marathe [:nsm] from comment #9)
> > > @@ +632,5 @@
> > > > +
> > > > +    } else {
> > > > +      debug('The WebSocket is still up');
> > > > +      prefs.set('adaptivePing.lastGood', lastTriedPing);
> > > > +      nextPingInterval = lastTriedPing * 2;
> > > 
> > > 1) Need a max bound of 30min.
> > 
> > Why do you think that we should cap it to 30 minutes? As Fernando said, this
> > is really different between networks and the idea of the ping is to maximize
> > the timing for when the WS is not closed and minimize battery usage.
> > 
> > We have discovered (with great surprise!) that the actual timeout of a
> > socket in TEF Spain without the network proxy is around 6 hours, which is
> > way better than the 28 minutes that network group gave us.
> 
> see previous comment. Are you saying that you could reproduce a case where
> you register for a push notification, then after sending no pings for 5-6
> hours, you were able to send a instant notification to the device? What will
> happen if that same cell tower is under load due to a big sports event or
> something?

Yes.

> 
> For example: I work at the Bernabeu as a security guard and the morning
> before the game I come in to work and my phone establishes a websocket. Over
> the morning it keeps increasing ping times and reaches 6 hours. Then in the
> evening the game starts and 80000 people come in and now the cell tower can
> no longer hold on to all these connections, so it drops my connection, but
> because of the 6 hour ping the device doesn't realise it.

If the server drops the WS connection cleanly, you will receive a TCP FIN to close your endpoint. If something happens in the middle, you can be at most that pingInterval to receive push notifications, so yes, we could end with a 6h of no notifications. Which is really bad.

Anyway, if your cell tower is under high load, you will have a PDP context but no a real connection, because the spectrum is full of signaling and not real data :(
As for why keep it configurable - I'd love to always use adaptive, but some players may want a constant one at this point, so let's keep it configurable. Once we clear that up, we can remove the configuration option if that is decided.
koi? because we think it's a nice addition for cellular networks that does not have UDP wakeup.

Adaptive ping helps on reducing the number of pings sent to the server, so battery usage will be better and signaling on the carrier network will benefit from that.
blocking-b2g: --- → koi?
Attached patch WIP2 (obsolete) — Splinter Review
1) Do not calculate adaptive ping if we do not want to (adaptive.enabled)

2) If the module fails to connect to the push server (_retryFailCount), do not calculate a new pingInterval.

3) Try each ping at least 3 times before saying that it is not valid (I think we should lower this)

4) Do not double and half: 2x and /1.75

5) Changed the logic to handle a Ping

Comments addressed.
Attachment #806586 - Attachment is obsolete: true
Attachment #808554 - Flags: feedback?(nsm.nikhil)
Comment on attachment 808554 [details] [diff] [review]
WIP2

Review of attachment 808554 [details] [diff] [review]:
-----------------------------------------------------------------

I think decreasing the ping interval should be aggressive (divide by 2), while increasing it should be more conservative (only multiply by 1.5 or so). Any thoughts?

Also remember to reset all the values when the network changes.

::: dom/push/src/PushService.jsm
@@ +653,5 @@
> +      }
> +
> +      //Latest ping was invalid, we need to lower the limit to limit / 1.75
> +      var lastGoodPing = prefs.get('adaptive.lastGoodPingInterval') || 60000;
> +      nextPingInterval = lastTriedPingInterval / 1.75;

floor() or ceil() before doing more math.
Attachment #808554 - Flags: feedback?(nsm.nikhil) → feedback+
Attached patch Patch (obsolete) — Splinter Review
This has two different ping intervals, one for wifi and other for cellular networks.

Wifi is only 'wifi' and cellular might change for different mcc-mnc. This should not be really common, but it's there.
Attachment #808554 - Attachment is obsolete: true
Attachment #811922 - Flags: review?(nsm.nikhil)
Nikhil says we need this in 1.2.
blocking-b2g: koi? → koi+
Comment on attachment 811922 [details] [diff] [review]
Patch

Review of attachment 811922 [details] [diff] [review]:
-----------------------------------------------------------------

Please make the following change to all preferences accessess.

All prefs that require a default value, please update modules/libpref/src/init/all.js Push entries to put the default value there rather than putting '||' checks in the code.

Guillermo, can you upload a new patch with comments addressed? I will review first thing tomorrow. Let's land this before the summit!

::: dom/push/src/PushService.jsm
@@ +635,5 @@
> +    }
> +
> +    // Save actual state of the network
> +    var ns = this._getNetworkState();
> +    var network = 'wifi';

s/var/let in both places.

@@ +636,5 @@
> +
> +    // Save actual state of the network
> +    var ns = this._getNetworkState();
> +    var network = 'wifi';
> +    if (ns.mcc && ns.mnc) {

Nit: ns.ip is set to undefined. That is a better check than checking both of these. Bonus points if you modify getNetworkState()'s return value to have a key 'type' with 'wifi' or 'gsm' (there is some enumeration in the RIL stack for this, use similar literals) and use that instead. Also change other uses of this check. You might also want it to return the MAC ID of the AP (see comments below).

@@ +637,5 @@
> +    // Save actual state of the network
> +    var ns = this._getNetworkState();
> +    var network = 'wifi';
> +    if (ns.mcc && ns.mnc) {
> +      network = ns.mcc + '-' + ns.mnc;

Prefix 'mobile-' before this?

@@ +643,5 @@
> +
> +    //Restart values if we have changed the network
> +    if (prefs.get('adaptive.network') !== network) {
> +      debug('Actual network is not the same as last saved. Restarting adaptive')
> +      debug('Old network: ' + prefs.get('adaptive.network'));

Expensive debugging call. Save the value of prefs.get() in a local since it is being used twice.

@@ +650,5 @@
> +      prefs.set('adaptive.reachedUpperBound', false);
> +      this._pingRetryTimes = {};
> +
> +      //Reload old values for the other network
> +      if (network === 'wifi') {

This may be a different access point than the previous one the user was using. So wifi is not a good bet. Maybe you should not differentiate between wifi and mobile, but simply observe that the network has changed. In that case can you plug into the network-active-changed event and not save the network as a pref'

For wifi use the MAC ID of the access point as the network name or something. Could this a privacy issue? (I think not since we are only storing it on the device).

@@ +663,5 @@
> +      }
> +    }
> +
> +    var nextPingInterval;
> +    var lastGoodPing = prefs.get('adaptive.lastGoodPingInterval') || 60000;

Start at 5min at the very least. Any network that can't sustain that is a pretty useless network for doing any network activity isn't it? We'll find out in 5min anyway and re-establish the socket.

Also s/var/let in both places.

@@ +664,5 @@
> +    }
> +
> +    var nextPingInterval;
> +    var lastGoodPing = prefs.get('adaptive.lastGoodPingInterval') || 60000;
> +    var lastTriedPingInterval = prefs.get('pingInterval' || 120000);

Start at 1.5 times the lastGoodPingInterval.

s/var/let

@@ +671,5 @@
> +      debug('The WebSocket was disconnected, calculating next ping');
> +
> +      // If we have not tried this pingInterval yet, initialize
> +      if (this._pingRetryTimes[lastTriedPingInterval] === null ||
> +          this._pingRetryTimes[lastTriedPingInterval] === undefined) {

Why the two checks? It doesn't seem like null is inserted into the table at any point.

@@ +677,5 @@
> +          ' to 0');
> +        this._pingRetryTimes[lastTriedPingInterval] = 0;
> +      } else {
> +        this._pingRetryTimes[lastTriedPingInterval] =
> +                                this._pingRetryTimes[lastTriedPingInterval] + 1;

Can you collapse this whole block to:

this._pingRetryTimes[lastTriedPingInterval] = (this._pingRetryTimes[lastTriedPingInterval] || 0) + 1;

This way it will start counting from 1 rather than 0. Since you have already tried the ping interval once, it makes sense to set it to 1. It also makes reasoning easier, since this is a counter and zero-based counting is un-intuitive.

@@ +683,5 @@
> +          this._pingRetryTimes[lastTriedPingInterval]);
> +      }
> +
> +      // Try the pingInterval at least 3 times, just to be sure that the
> +      // calculated interval is not valid

Nit: Period at the end.

@@ +690,5 @@
> +          this._pingRetryTimes[lastTriedPingInterval] + ' times');
> +        return;
> +      }
> +
> +      //Latest ping was invalid, we need to lower the limit to limit / 1.75

Nit: Comment 1.75 does not match code, best to reduce it to '... we need to lower the limit.'. Please put a space after //.

@@ +710,5 @@
> +
> +    } else {
> +      debug('The WebSocket is still up');
> +      if (prefs.get('adaptive.reachedUpperBound')) {
> +        debug('Adaptive ping reached upper bound');

please print the upper bound value reached.

@@ +717,5 @@
> +      lastGoodPing = lastTriedPingInterval;
> +      nextPingInterval = Math.floor(lastTriedPingInterval * 1.5);
> +    }
> +
> +    nextPingInterval = nextPingInterval;

o_O :)

@@ +718,5 @@
> +      nextPingInterval = Math.floor(lastTriedPingInterval * 1.5);
> +    }
> +
> +    nextPingInterval = nextPingInterval;
> +    debug('Setting the nextPing to ' + nextPingInterval);

'Setting the pingInterval to '

@@ +1501,2 @@
>  
>      this._waitingForPong = false;

Move this near the ping handling code.

@@ +1515,5 @@
> +        (reply.messageType === "ping") ||
> +        (typeof reply.messageType != "string")) {
> +      debug('Pong received');
> +      this._calculateAdaptivePing(false);
> +      var doNotHandle = true;

Scope! doNotHandle is local here, will be unavailable later. Has this been tested?

@@ +1517,5 @@
> +      debug('Pong received');
> +      this._calculateAdaptivePing(false);
> +      var doNotHandle = true;
> +    }
> +

Use only the 'message === {}' test. Since the Mozilla server supports it, I see no reason to not make it the *only* way.

1.1 users are not affected by any of these changes since 1.1 does not care about the data. So we are not breaking anything.
Attachment #811922 - Flags: review?(nsm.nikhil)
I am trying to get the MAC address from the connected network but it is not possible:

>    try {
>      let wm = Cc["@mozilla.org/wifimanager;1"].getService(Ci.nsIDOMWifiManager);
>      mac = wm.connection.network.bssid;
>    } catch(e) {
>      debug("No mozWifiManager. Error: " + e);
>    }

This requires to have a window, and wifi-manage permission :(. Trying wifi worker does not get the job done since there are no communication from the parent (us) to the worker.

So I think we can live with the fact that sometimes we will be connected a bad network and have socket disconection events. Should I work more around this?

What do you think?
Flags: needinfo?(nsm.nikhil)
Clearing ni? based on Skype conversation.
Guillermo, any updates on this? It is a blocker. Should the blocking be removed for now?
Flags: needinfo?(nsm.nikhil) → needinfo?(willyaranda)
Yes, we are busy with other things, but I'm planning to do this the weekend with the discussion we had on Skype.

Sorry for the delay…
Flags: needinfo?(willyaranda)
Nikhil, why is this needed for 1.2 when it wasn't needed for 1.1? This sounds like an enhancement, it's way too late to be blocking on this for 1.2.
blocking-b2g: koi+ → koi?
Flags: needinfo?(nsm.nikhil)
Keywords: feature
ni? willyaranda, been waiting for a patch. I'm fine with dropping the blocker.
Flags: needinfo?(nsm.nikhil) → needinfo?(willyaranda)
Okay, thanks, dropping koi+.  Renom for 1.3 if you want it to block that release ... or just land before/in 28 and we're good :)
blocking-b2g: koi? → ---
Yeah, thanks for removing, will try to find free slots end this week.
Flags: needinfo?(willyaranda)
blocking-b2g: --- → 2.0?
Attached patch Patch v2 (obsolete) — Splinter Review
Attachment #811922 - Attachment is obsolete: true
Attachment #8434015 - Flags: review?(nsm.nikhil)
Nikhil, please, can you review it, we need it landed this week to be included in 2.0, thanks
Flags: needinfo?(nsm.nikhil)
Clearing the blocking flag - the blocking flag is not intended to be used for tracking features. If you want this to be included in the release, then you need to get someone to set the feature-b2g flag to 2.0.
blocking-b2g: 2.0? → ---
Comment on attachment 8434015 [details] [diff] [review]
Patch v2

Review of attachment 8434015 [details] [diff] [review]:
-----------------------------------------------------------------

Please add default preferences to the prefs files.

::: dom/push/src/PushService.jsm
@@ +427,4 @@
>    _requestTimeoutTimer: null,
>    _retryFailCount: 0,
>  
> +  _pingRetryTimes: {},

Rename to _pingIntervalRetryTimes and add a comment about its purpose, and the key value map.

@@ +603,5 @@
> +   *  1) Latest good ping
> +   *  2) A safe gap between 1) and the calculated new ping (which is
> +   *  by default, 1 minute)
> +   *
> +   * This is for 3G networks, whose connections' keepalives differ broadly,

whose connection keepalives differ broadly.

@@ +617,5 @@
> +   *
> +   * This algorithm tries to search the best value between a disconnection and a
> +   * valid ping, to ensure better battery life and network resources usage.
> +   *
> +   * The value is saved on services.push.pingInterval

s/on/in/

@@ +635,5 @@
> +        'Do not calculate a new pingInterval now');
> +      return;
> +    }
> +
> +    if (prefs.get('adaptive.reachedUpperBound')) {

Does reachedUpperBound really need to be a pref? I think it can be just a member of PushService.

@@ +666,5 @@
> +        prefs.set('pingInterval', defaultPing);
> +        prefs.set('adaptive.lastGoodPingInterval', defaultPing);
> +
> +      // Mobile network is the same, let's just update things
> +      } else {

Nit: Please move the comment into the else block.

@@ +691,5 @@
> +      this._pingRetryTimes[lastTriedPingInterval] =
> +           (this._pingRetryTimes[lastTriedPingInterval] || 0) + 1;
> +
> +       // Try the pingInterval at least 3 times, just to be sure that the
> +       // calculated interval is not valid.

Do you mean it is valid? s/not//

@@ +709,5 @@
> +        debug('We have reached the gap, we have finished the calculation');
> +        debug('nextPingInterval=' + nextPingInterval);
> +        debug('lastGoodPing=' + lastGoodPing);
> +        nextPingInterval = lastGoodPing;
> +        prefs.set('adaptive.reachedUpperBound', true);

I think you need to check for > adaptive.upperLimit since the above if statement is also true in case of the delta.

@@ +730,5 @@
> +    prefs.set('pingInterval', nextPingInterval);
> +    prefs.set('adaptive.lastGoodPingInterval', lastGoodPing);
> +
> +    //Save values for our current network
> +    if (!ns.ip) {

Since you followed the mobile first, wifi second order above, please follow the same here.

@@ -1292,5 @@
>        return;
>      }
>  
> -    // Since we've had a successful connection reset the retry fail count.
> -    this._retryFailCount = 0;

Why was this removed?

@@ +1522,5 @@
> +      debug('Pong received');
> +      this._calculateAdaptivePing(false);
> +      doNotHandle = true;
> +    }
> +

Any message received from the server should set waitingForPong to false, since that means the connection is up. So undo the change to _waitingForPong. In addition the call to setAlarm should also always happen, irrespective of what a server replies with. So the block that calculates a new ping only if the message was truly a pong should be between those two lines. I think you can then also get rid of the doNotHandle and use the original escape condition of messageType not being a string.
Attachment #8434015 - Flags: review?(nsm.nikhil) → review-
Comment on attachment 8434015 [details] [diff] [review]
Patch v2

Review of attachment 8434015 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/push/src/PushService.jsm
@@ +635,5 @@
> +        'Do not calculate a new pingInterval now');
> +      return;
> +    }
> +
> +    if (prefs.get('adaptive.reachedUpperBound')) {

I think this should be a pref, since we need to check for it between reboots. But, we can also think of reset that variable and do the calculation each time we reboot the system (I am not a fan of this).

@@ +691,5 @@
> +      this._pingRetryTimes[lastTriedPingInterval] =
> +           (this._pingRetryTimes[lastTriedPingInterval] || 0) + 1;
> +
> +       // Try the pingInterval at least 3 times, just to be sure that the
> +       // calculated interval is not valid.

Yes, "not valid", as we need to test this 3 times to say that is not working on this pingInterval.

@@ -1292,5 @@
>        return;
>      }
>  
> -    // Since we've had a successful connection reset the retry fail count.
> -    this._retryFailCount = 0;

This is moved to line 1514
Attachment #8434015 - Attachment is obsolete: true
Attachment #8435709 - Flags: review?(nsm.nikhil)
(In reply to Guillermo López :willyaranda from comment #32)
> Comment on attachment 8434015 [details] [diff] [review]
> Patch v2
> 
> Review of attachment 8434015 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: dom/push/src/PushService.jsm
> @@ +635,5 @@
> > +        'Do not calculate a new pingInterval now');
> > +      return;
> > +    }
> > +
> > +    if (prefs.get('adaptive.reachedUpperBound')) {
> 
> I think this should be a pref, since we need to check for it between
> reboots. But, we can also think of reset that variable and do the
> calculation each time we reboot the system (I am not a fan of this).
> 

I sure hope users won't be rebooting their phone all the time :) I think it's fine to NOT make it a pref. If you do want it for testing purposes so you don't have to wait after flashing for ages until the new interval is calculated, please call it |adaptive.testing.reachedUpperBound| and don't set any default value for it.
Flags: needinfo?(nsm.nikhil)
Comment on attachment 8435709 [details] [diff] [review]
Calculate a adaptive ping for Push on mobile and wifi networks

Review of attachment 8435709 [details] [diff] [review]:
-----------------------------------------------------------------

r- because
- No default prefs added (modules/libpref/src/init/all.js)
- please make reachedUpperBound a member instead.
- fix review comments/answer that last bit about moving retryFailCount = 0 into the conditional.

::: dom/push/src/PushService.jsm
@@ +715,5 @@
> +        debug('We have reached the gap, we have finished the calculation');
> +        debug('nextPingInterval=' + nextPingInterval);
> +        debug('lastGoodPing=' + lastGoodPing);
> +        nextPingInterval = lastGoodPing;
> +        prefs.set('adaptive.reachedUpperBound', true);

I guess I wasn't clear earlier.
This pref will be set even if |nextPingInterval| is LESS THAN upperLimit, simply because it's ORed with the adaptive.gap check. But you only want to set it when the |nextPingInterval > upperLimit| condition is satisfied.

@@ +1518,5 @@
>  
> +    // If we are not waiting for a hello message, reset the retry fail count
> +    if (this._currentState != STATE_WAITING_FOR_HELLO) {
> +      debug('Reseting _retryFailCount and _pingIntervalRetryTimes');
> +      this._retryFailCount = 0;

I don't understand why _retryFailCount has to be moved in here. When the websocket connection is started, can't it trivially be set to 0 as it was before?
Attachment #8435709 - Flags: review?(nsm.nikhil) → review-
:willyaranda, I missed it in the review, but please use |let| instead of |var| in case you aren't already.
Attachment #8435709 - Attachment description: . Calculate a adaptive ping for Push on mobile and wifi networks → Calculate a adaptive ping for Push on mobile and wifi networks
Attachment #8435709 - Attachment is obsolete: true
Attachment #8435989 - Attachment description: . Calculate an adaptive ping for Push notifications → Calculate an adaptive ping for Push notifications
Attachment #8435989 - Flags: review?(nsm.nikhil)
Comment on attachment 8435989 [details] [diff] [review]
Calculate an adaptive ping for Push notifications

Review of attachment 8435989 [details] [diff] [review]:
-----------------------------------------------------------------

We should discuss this over IRC/skype before doing a review again. I feel there are several edge cases that should be uncovered before this is landed.

::: b2g/app/b2g.js
@@ +458,5 @@
>  // Interval at which to ping PushServer to check connection status. In
>  // milliseconds. If no reply is received within requestTimeout, the connection
>  // is considered closed.
>  pref("services.push.pingInterval", 1800000); // 30 minutes
> +pref("services.push.pingInterval.default", 1800000);// 30 min

The first pingInterval calculation isn't going to happen until 30min in the common case where the client doesn't do any register/unregister and a push isn't received (potentially because the connection is down). It would be better to start with a lower initial ping interval and have it increase.

@@ +467,5 @@
>  // enable udp wakeup support
>  pref("services.push.udp.wakeupEnabled", true);
> +// Adaptive ping
> +pref("services.push.adaptive.enabled", true);
> +pref("services.push.adaptive.mobile", 'mobile-0-0');

can be removed. prefs.get() will just return undefined and the old != new check will fail as required.

::: dom/push/src/PushService.jsm
@@ +636,5 @@
> +   *
> +   */
> +  _calculateAdaptivePing: function(wsWentDown) {
> +    debug('_calculateAdaptivePing()');
> +    if (!prefs.get('adaptive.enabled')) {

we should read this pref in init() and store as a member and check the member here to avoid reading it every time the phone reconnects or has to ping again.

@@ +655,5 @@
> +      return;
> +    }
> +
> +    // Save actual state of the network
> +    let ns = this._getNetworkState();

Bug 1018088 is going to make this async and also extract some other parameters. Do you want to add one that returns the ip/mcc/mnc since that is sync and then refactor bug 1018088 to use that and the netid.

@@ +661,5 @@
> +    if (ns.ip) {
> +      // mobile
> +      debug('mobile');
> +      let oldNetwork = prefs.get('adaptive.mobile');
> +      let newNetwork = 'mobile-' + ns.mcc + '-' + ns.mnc;

would netid be relevant here?

@@ +1525,5 @@
>        return;
>      }
>  
> +    // If we are not waiting for a hello message, reset the retry fail count
> +    if (this._currentState != STATE_WAITING_FOR_HELLO) {

I think this condition continues to be incomplete. There are other messages (register/notification) that would lead to pingIntervalRetryTimes being reset in this case, but that isn't correct since the adaptive ping calculations should be based solely on pings to be correct. Otherwise a notification immediately following the last ping will lead to the pingInterval being increased.

@@ +1532,5 @@
> +      this._pingIntervalRetryTimes = {};
> +    }
> +
> +    // Calculate a new adaptive ping, with the WS still open.
> +    this._calculateAdaptivePing(false);

should be moved inside the ping check based on the above comment.
Attachment #8435989 - Flags: review?(nsm.nikhil) → review-
Attached patch v5 (obsolete) — Splinter Review
(This needs bug 1024579 to be landed)
Attachment #8435989 - Attachment is obsolete: true
Attachment #8440651 - Flags: review?(nsm.nikhil)
Attached patch v6 (obsolete) — Splinter Review
Make the capping of a upperLimit correct
Attachment #8440651 - Attachment is obsolete: true
Attachment #8440651 - Flags: review?(nsm.nikhil)
Attachment #8440780 - Flags: review?(nsm.nikhil)
Comment on attachment 8440780 [details] [diff] [review]
v6

Review of attachment 8440780 [details] [diff] [review]:
-----------------------------------------------------------------

Canceling review, please ask for it after the variable name has been corrected and the other things I pointed out are fixed.
If you do end up making other changes please mention them in the comments.

::: b2g/app/b2g.js
@@ +467,5 @@
> +pref("services.push.pingInterval.default", 180000);// 3 min
> +pref("services.push.pingInterval.mobile", 180000); // 3 min
> +pref("services.push.pingInterval.wifi", 180000);  // 3 min
> +// How long before a DOMRequest errors as timeout
> +pref("services.push.requestTimeout", 10000);

This change not relevant to this patch.

@@ +469,5 @@
> +pref("services.push.pingInterval.wifi", 180000);  // 3 min
> +// How long before a DOMRequest errors as timeout
> +pref("services.push.requestTimeout", 10000);
> +// enable udp wakeup support
> +pref("services.push.udp.wakeupEnabled", true);

Ditto.

@@ +478,5 @@
> +pref("services.push.adaptive.lastGoodPingInterval.wifi", 180000);// 3 min
> +// Valid gap between the biggest good ping and the bad ping
> +pref("services.push.adaptive.gap", 60000); // 1 minute
> +// We limit the ping to this maximum value
> +pref("services.push.adaptive.upperLimit", 3600000); // 1 hour

We discussed capping this to 29/30 minutes.

::: dom/push/src/PushService.jsm
@@ +462,5 @@
> +   * This saves a flag about if we have reached a limit for adaptive ping.
> +   * If this is true, the adaptive ping is not calculated because we have a
> +   * small gap between a good known ping and a bad known ping.
> +   */
> +  _reachedUpperBound: false,

In the previous patches, _reachedUpperBound was clearly used for when the calculated ping value had reached a maximum interval between consecutive pings (something like 30 minutes). Why has this variable been repurposed simply to say "don't recalculate ping" even in the case of the gap between the new ping and the old ping being too small being the reason for this. Either rename the variable or restore it to it's original purpose please.
Attachment #8440780 - Flags: review?(nsm.nikhil)
Attached patch v7 (obsolete) — Splinter Review
Changes:
 - Renamed _reachedUpperBound to _recalculatePing to hold if we need to calculate a new ping in next calls.
 - Removed the two extra prefs on b2g.js
Attachment #8440780 - Attachment is obsolete: true
Attachment #8446443 - Flags: review?(nsm.nikhil)
Comment on attachment 8446443 [details] [diff] [review]
v7

Review of attachment 8446443 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/push/src/PushService.jsm
@@ +507,5 @@
>      this._db = new PushDB();
>  
> +    this._adaptiveEnabled = prefs.get('adaptive.enabled');
> +    this._upperLimit = prefs.get('adaptive.upperLimit');
> +

Stylistic nit: Please move these 2 lines down to after _requestTimeout is read.

@@ +631,5 @@
>     */
>    _reconnectAfterBackoff: function() {
>      debug("reconnectAfterBackoff()");
> +    //Calculate new ping interval
> +    this._calculateAdaptivePing(true);

Nit: |true /* wsWentDown */| would make it easier to read.
Attachment #8446443 - Flags: review?(nsm.nikhil) → review+
I don't really see why qawanted was added here, so clearing the flag.
Keywords: qawanted
Attached patch Final patchSplinter Review
Carrying a r+ from nsm from previous comment
Attachment #8446443 - Attachment is obsolete: true
Attachment #8452949 - Flags: review+
hi,

could you provide a Try link. Suggestions for what to run if you haven't
yet can be found here:
https://wiki.mozilla.org/Sheriffing/How:To:Recommended_Try_Practices
Keywords: checkin-needed
Hi Tomcat,

please refer to https://bugzilla.mozilla.org/show_bug.cgi?id=1024579#c5 . There is no tests in push code now, so try will make no difference :(
Flags: needinfo?(cbook)
Flags: needinfo?(cbook)
https://hg.mozilla.org/mozilla-central/rev/ab6938cf518f
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Comment on attachment 8452949 [details] [diff] [review]
Final patch

Approval Request Comment
[Feature/regressing bug #]: Adaptive ping
[User impact if declined]: Things working as they are now but unoptimized usage of battery and network resources in LatAm (one variant for all countries, with really different networks, pings differ broadly, we need to configure the smallest ping, instead of learning from the network).
[Describe test coverage new/current, TBPL]: None.
[Risks and why]: 
[String/UUID change made/needed]: None

This patch need also https://bugzilla.mozilla.org/show_bug.cgi?id=1024579
Attachment #8452949 - Flags: approval-mozilla-aurora?
Attachment #8452949 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8452949 [details] [diff] [review]
Final patch

Doug - This is a late breaking feature & doesn't have a risk assessment on the bug here. This needs to be ran through product to determine whether this should or shouldn't be taken.

We're past 6/20 anyways though, so it's likely that this won't be approved.
Attachment #8452949 - Flags: approval-mozilla-aurora+ → approval-mozilla-aurora?
The risk here is that if there is any bug in the code, we might end up reaching a 30 minutes ping (because it's capped to this value), which will be the same as previous code (so, the behaviour will be the same)

OR

we might end up in a small ping (3 minutes, as this is the minimum value, and in some countries like Brazil, this will be more or less the default behaviour (5 minutes of keepalive)).
No longer blocks: 1038078
We also tested it in several scenarios and we did not see any issue, anyway we added a follow up in bug 1038078 to include integration tests
(In reply to Guillermo López :willyaranda from comment #50)
> Comment on attachment 8452949 [details] [diff] [review]
> Final patch
> 
> Approval Request Comment
> [Feature/regressing bug #]: Adaptive ping
Is adaptive ping a new feature? If it is an existing feature, when was it introduced?

> [User impact if declined]: Things working as they are now but unoptimized
> usage of battery and network resources in LatAm (one variant for all
> countries, with really different networks, pings differ broadly, we need to
> configure the smallest ping, instead of learning from the network).
Can you explain what 'unoptimized' means in this case? What battery and network savings are seen with this patch?

> [Describe test coverage new/current, TBPL]: None.
This feature needs tests. I think bug 1038078 needs to be ready to land with this feature. Can you also comment on how and what you tested as per comment 53?

> [Risks and why]: 
Looks like risk is documented in comment 52.

> [String/UUID change made/needed]: None
> 
> This patch need also https://bugzilla.mozilla.org/show_bug.cgi?id=1024579

As Jason said, we're past the feature deadline for 2.0. Given the state of the 2.0 tree, we really need to limit the changes that we're taking. I think we'll need a strong performance case to consider this feature.
Flags: needinfo?(willyaranda)
Let me explain more in detail, to handle push notifications from push server, the device has a WebSocket permanently opened with the server. To avoid it is closed by the network, the device has to send a ping periodically and the interval is configured in the build. This interval is highly dependent on the mobile network the device is connected to and also it is different if it is connected via WiFi. So, having a short ping interval means a lot of power consumption in the device and also high network activity. 

To give you a real work example, we use same build for several LatAm countries, to make logistic easier and save money. In one of the countries the ping interval has to be 5 minutes, otherwise connection would be closed by the network. For remaining countries it could be higher, 30 minutes or more, but we cannot take advantage of it, because the build is the same, so we have to configure the most restrictive ping interval.

What this patch does is to have a dynamic ping interval, learning from the network which one is the best one in each moment. It is a low risk patch, and not taking it means making user experience worse: having to recharge battery very often, and also expending more money because of frequent network usage.
Just one more thing to add, there is preference to disable this patch during build time
(In reply to Lawrence Mandel [:lmandel] from comment #54)
> As Jason said, we're past the feature deadline for 2.0. Given the state of
> the 2.0 tree, we really need to limit the changes that we're taking. I think
> we'll need a strong performance case to consider this feature.

The main reason of landing this is battery usage and network resources in LatAm countries.

TEF has just one variant for LatAm, that defines a single value for keepalives (or ping messages) for the whole continent.

This means that pings will be the same in Brazil and in Colombia. The problem here is that Brazil has a socket timeout of 5 minutes, and 28 in Colombia. So we need to put 4 minutes as a ping, and that will be used in Brazil (it's OK), and in Colombia (it is really BAD).

We would like to have this landed on 2.0, so we can enhance our users' experience with Firefox OS.
Flags: needinfo?(willyaranda) → needinfo?(lmandel)
I agree that adaptive ping seems highly desirable but that doesn't change where we are in the schedule. I would prefer to push this new feature work out to 2.1. What is the state of the devices that have been released with earlier builds? Have you had a lot of user complaints about battery and network usage?
Flags: needinfo?(lmandel) → needinfo?(willyaranda)
We do not have statistics about this issue compared to other issues, but it something we tested in lab, and we could see that ping interval has a great impact in both battery and network usage.

What about have it disabled by default? This way it does not change current behavior by default and we can enable it in our builds
Flags: needinfo?(willyaranda)
Comment on attachment 8452949 [details] [diff] [review]
Final patch

Discussed offline, Lucas made the point that the risk of uplifting bug 1026599, bug 894879, and bug 1024579 is much lower than not uplifting, especially given that this will end up in the vendor build anyway. Given the options, it is preferable that we maintain visibility into this change and the impact that it has on 2.0. I am approving the remaining two bugs (bug 1024579 has already been uplifted). Although preffed off, we need to monitor these landings for any fallout. b2g32+
Attachment #8452949 - Flags: approval-mozilla-aurora? → approval-mozilla-b2g32+
Lawence, the patch enables adaptive ping by default. Should we open another bug to make this disabled by default?
Flags: needinfo?(lmandel)
(In reply to Guillermo López :willyaranda from comment #61)
> Lawence, the patch enables adaptive ping by default. Should we open another
> bug to make this disabled by default?

Yes. Please flag me for review and let's land both at the same time.
Flags: needinfo?(lmandel)
> Yes. Please flag me for <strike>review</strike> approval and let's land both at the same time.
See Also: → 1152264
You need to log in before you can comment on or make changes to this bug.