Closed Bug 863732 Opened 7 years ago Closed 7 years ago

SimplePush: Use RTC wakeup to monitor WebSocket connection.

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set

Tracking

(firefox22 wontfix, firefox23 wontfix, firefox24 fixed, b2g18+ fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix)

RESOLVED FIXED
Tracking Status
firefox22 --- wontfix
firefox23 --- wontfix
firefox24 --- fixed
b2g18 + fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix

People

(Reporter: nsm, Assigned: nsm)

References

Details

(Whiteboard: [fixed-in-birch])

Attachments

(2 files, 5 obsolete files)

As discussed at the Madrid work week, the RTC wakeup is the only way to detect that the connection is alive when the phone is in sleep mode. This patch adds application level ping pong (empty JSON object) with the alarms patch from bug 862322 and PushService-is-a-module from 863598. Wakeup interval is about 5 minutes.
Must be applied over patches for the bugs this depends on.
Attachment #739613 - Flags: review?(justin.lebar+bug)
We have a ping message which is "PING" instead of "{}". I like this approach, so we will change our PING message to {}.

But you are not expecting any response from the server, right? (we are sending a "PONG")
@willyaranda: Expecting 'any data', but ideally you'd send '{}' as its half the size and valid JSON :)
of course, if you want to be smart on the server (I don't believe it to be necessary), then if you have messages to be delivered anyway, then you could just not write the pong, but send those messages instead. Basically the client code just wants to receive some data from the server to know that the connection is alive and working.
(In reply to Nikhil Marathe from comment #3)
> @willyaranda: Expecting 'any data', but ideally you'd send '{}' as its half
> the size and valid JSON :)
> of course, if you want to be smart on the server (I don't believe it to be
> necessary), then if you have messages to be delivered anyway, then you could
> just not write the pong, but send those messages instead. Basically the
> client code just wants to receive some data from the server to know that the
> connection is alive and working.

agree. To optimize the BW and network resources, we should reuse the PING-PONG to send pending messages.

@willyaranda: This is the same I drafted 4 months ago (Issue #122 & PR #137). I'll retake it ;)
> To optimize the BW and network resources, we should reuse the PING-PONG to send pending messages.

I don't think we should do this, for two reasons.

1. We can't do SSL over UDP, so there's no way to authenticate the data we'd receive.  We insist that the websocket be over SSL, and I think there's a good reason for that.

2. If we sent data over UDP, we'd have to fit it into a single packet.  Spec'ing a minimum network MTU seems pretty undesirable to me; it would place constraints on what network operators could do inside their private networks.
(In reply to Justin Lebar [:jlebar] from comment #5)
> > To optimize the BW and network resources, we should reuse the PING-PONG to send pending messages.
> 
> I don't think we should do this, for two reasons.
> 
> 1. We can't do SSL over UDP, so there's no way to authenticate the data we'd
> receive.  We insist that the websocket be over SSL, and I think there's a
> good reason for that.
> 
> 2. If we sent data over UDP, we'd have to fit it into a single packet. 
> Spec'ing a minimum network MTU seems pretty undesirable to me; it would
> place constraints on what network operators could do inside their private
> networks.

UDP isn't being used. Ping-Pong is only done over Websocket, and only when UDP is not in use.
(In reply to Justin Lebar [:jlebar] from comment #5)
> > To optimize the BW and network resources, we should reuse the PING-PONG to send pending messages.
> 
> I don't think we should do this, for two reasons.
> 
> 1. We can't do SSL over UDP, so there's no way to authenticate the data we'd
> receive.  We insist that the websocket be over SSL, and I think there's a
> good reason for that.
> 

Probably some misunderstanding :)

Over UDP NO DATA is sent, it's only used to wakeup the device. Here we're talking about Websocket connection where the client network hasn't a WakeUp server so a permanent Websocket connection should be maintained open.

To reach this, we need to PING-PONG (keepalive) every X minutes to avoid intermediate servers closes the connection.


> 2. If we sent data over UDP, we'd have to fit it into a single packet. 
> Spec'ing a minimum network MTU seems pretty undesirable to me; it would
> place constraints on what network operators could do inside their private
> networks.


Same as before, UDP is not used to sent ANYTHING. probably a misunderstanding ;)
> UDP isn't being used. Ping-Pong is only done over Websocket, and only when UDP is not in use.

I see; sorry.  I guess I should have read the patch first.  :)

> if you want to be smart on the server (I don't believe it to be necessary), then if you have 
> messages to be delivered anyway, then you could just not write the pong, but send those messages 
> instead. 

Now that I understand what you mean, this sounds great.

If I'm understanding the patch correctly, we have to wake up and turn on the radio every five minutes to send a ping?  Is that what prior art does?  At the very least, we need to do tests to see how this affects battery life.  But I think you should get mvines's sign-off as well.
(In reply to Justin Lebar [:jlebar] from comment #8)
> > UDP isn't being used. Ping-Pong is only done over Websocket, and only when UDP is not in use.
> 
> I see; sorry.  I guess I should have read the patch first.  :)
> 

Yeap ! ;)

> > if you want to be smart on the server (I don't believe it to be necessary), then if you have 
> > messages to be delivered anyway, then you could just not write the pong, but send those messages 
> > instead. 
> 
> Now that I understand what you mean, this sounds great.
> 

:)

> If I'm understanding the patch correctly, we have to wake up and turn on the
> radio every five minutes to send a ping?  Is that what prior art does?  At
> the very least, we need to do tests to see how this affects battery life. 
> But I think you should get mvines's sign-off as well.

5 minutes is crazy, other PUSH solutions ping each 28 minutes (GCM for example). Adaptative protocol is the best way to find the longest Timeout without lost the connection like APNS.
Comment on attachment 739613 [details] [diff] [review]
Use RTC wakeup to monitor WebSocket connection.

>diff --git a/dom/push/src/PushService.jsm b/dom/push/src/PushService.jsm
>--- a/dom/push/src/PushService.jsm
>+++ b/dom/push/src/PushService.jsm

>+      case "alarm-fired":
>+        debug("ALARM FIRED! " + aData);
>+        if (JSON.parse(aData).id != this._alarmID) {
>+            debug("Not our alarm");
>+            return;
>+        }

I think this is a bad API.  It means we have to wake up every observer just so
they can JSON.parse(aData) and check the ID.  But I'll comment in the other
bug.

>+        // This check for this._ws is required. If we set an alarm, and before
>+        // it fires, the websocket is closed, the currently active alarm is not
>+        // stopped. When it fires, this._ws will be null.
>+        if (!this._ws || this._currentState != STATE_READY) {
>+          debug("WebSocket is not active");
>+          return;
>+        }
>+
>+        if (this._pingsWithNoReply >= prefs.get("maxConsecutivePingAttempts")) {
>+          debug("Too many pings with no replies received." +
>+                "Reconnecting WebSocket.");
>+          this._shutdownWS();
>+          this._reconnect();
>+          return;
>+        }
>+
>+        this._setAlarm();
>+        // bypass the queue, we don't want this to
>+        // be kept pending.

Nits: Start sentences with capital letters, please.  And s/,/;/ to fix the
comma splice.  Also this doesn't need to be wrapped to a new line.

>+        this._ws.sendMsg('{}');
>+
>+        // Initial value is 1 because the check is performed before the
>+        // increment. So when the alarm that was set right now fires, there
>+        // will be one ping with no reply. On the other hand, if a reply was
>+        // received, onMessageAvailable() will reset _pingsWithNoReply to 0.
>+
>+        this._pingsWithNoReply = (this._pingsWithNoReply == undefined) ?
>+                                    1 : this._pingsWithNoReply + 1;

I think you want ===?  Also please move the 1 right under the "(".

>   // aStatusCode is an NS error from Components.results
>-  _socketError: function(aStatusCode) {
>-    debug("socketError()");
>+  _reconnect: function() {

Maybe "reconnectAfterBackoff" would be a better name for this; I was confused
as to why we were reconnecting immediately, but that's not what we do.

>@@ -550,21 +599,37 @@ this.PushService = {
>     else {
>       debug("Unsupported websocket scheme " + uri.scheme);
>       return;
>     }
> 
>     debug("serverURL: " + uri.spec);
>     this._wsListener = new PushWebSocketListener(this);
>     this._ws.protocol = "push-notification";
>-    this._ws.pingInterval = prefs.get("websocketPingInterval");

Ah, I see that we used to ping once every minute, but now we ping every 5m.
That's an improvement!

I'd still like to get mvine's opinion on this, though.

>+  _setAlarm: function() {
>+    AlarmService.add({
>+      date: new Date(Date.now()+prefs.get("pingInterval")),

Spaces around operators, please.

>@@ -1156,27 +1224,31 @@ this.PushService = {
>    * connection close status code.
>    *
>    * If we do not explicitly call ws.close() then statusCode is always
>    * NS_BASE_STREAM_CLOSED, even on a successful close.
>    */
>   _wsOnStop: function(context, statusCode) {
>     debug("wsOnStop()");
> 
>+    this._stopAlarm();
>+    this._shutdownWS();

shutdownWS calls stopAlarm.

This code is fine, but it sounds like 5m is the wrong thing to do.  If you want
to address that in a follow-up, that's fine with me, since this code makes
things better than they were.
Attachment #739613 - Flags: review?(justin.lebar+bug) → review+
FYI, I'm going to be really busy this week with some blockers, so I'm sorry in advance if I flake on reviews and so on.
Fixed up based on review.
Attachment #739613 - Attachment is obsolete: true
Attachment #741148 - Flags: review?(justin.lebar+bug)
Attachment #741148 - Flags: feedback?(mvines)
I r+'ed the last patch, so I don't need to look at this new patch to make sure that you made the changes I asked for.

Is there anything in particular you'd like me to look at here?
Comment on attachment 741148 [details] [diff] [review]
Use RTC wakeup to monitor WebSocket connection.

Carrying forward earlier r+ by jlebar.
Attachment #741148 - Flags: review?(justin.lebar+bug) → review+
Michael,

I'd really appreciate some feedback on a good interval. GCM has a 28min ping interval (i.e. it sends a keep alive packet every 28min), and reconnects if the server doesn't reply.
Flags: needinfo?(mvines)
The less frequent the interval the better, clearly.  Drain caused by a quick ~100ms wakeup every 30min-ish seems likely to go unnoticed by the user
Flags: needinfo?(mvines)
This patch is fundamentally flawed for longer push intervals. What it does is set a timer and wait around for a few failed pings. With a long ping interval, this will wait around for multiples of that to pronounce failure.

Working up a new patch with two timers. One timer is the ~30 minute timer that controls 'let's send a ping now'. That should start another timer after sending a ping, a shorter one ~15-20 sec that checks if the server responded within that time, otherwise tries to reconnect.

Patch coming up.
Attachment #741148 - Attachment is obsolete: true
Attachment #741148 - Flags: feedback?(mvines)
Fixes as per comment 17.

Switched retryTimeoutTimer (used for exponential back off reconnection) to alarms as well.
Attachment #744491 - Flags: review?(justin.lebar+bug)
>+// Interval at which to ping PushServer to check connection status. In
>+// milliseconds. If no reply is received within requestTimeout, connection is
>+// considered closed.

s/connection/the connection/

>+pref("services.push.pingInterval", 1800000); // 30 minutes

>diff --git a/dom/push/src/PushService.jsm b/dom/push/src/PushService.jsm
>--- a/dom/push/src/PushService.jsm
>+++ b/dom/push/src/PushService.jsm

Call me weak, but three is apparently more timers than I can follow.  We have

 - Retry timer
   Delay: Between retryBaseInterval (5s) and maxRetryInterval (20m)
   Started in: reconnectAfterBackoff, which is called in onPongTimeout and
      _wsOnStop.
   When fired: _beginWSSetup

 - pingIntervalAlarm
   Delay: pingInterval (30m)
   Started in: _onPongTimeout if no pong is pending, and after finishing the WS
     handshake.
   When fired: Send the server an empty message, and wait for a reply (using
     setPongTimeoutAlarm).

 - pongTimeoutAlarm
   Delay: requestTimeout (10s)
   Started: After pinging server in _onPingIntervalTimeout.
   When fired: Either set ping interval alarm, if we got a pong in time, or
     shut down the WS and reconnect after a backoff.

This is confusing to me because all three timers trigger each other, after
varying delays and under various circumstances, and the control flow is
scattered across the code, so you have to have global knowledge to see how
anything works.

Can we try to untangle this spaghetti?

At any given time, we need only one alarm pending, I think.  And we should be
in one of a few states:

 * connected, waiting to send ping
 * connected, waiting for pong
 * disconnected, waiting to try reconnecting
 * disconnected, currently handshaking

There's no need to send a ping if we've heard from the server in the past 30m,
I think.  So every time we receive a message we can reset our alarm to 30m in
the future.  That's good for battery life.

Somehow it seems like the maximum reconnect backoff (the 20m) should equal the
ping delay (30m).  I don't have a good reason for this, though; it just seems
elegant.  So feel free to ignore this.  :)

Do you think you could try making this patch more straightforward with all this
in mind?

>@@ -388,30 +388,30 @@ this.PushService = {
>   /**
>    * How retries work:  The goal is to ensure websocket is always up on
>    * networks not supporting UDP. So the websocket should only be shutdown if
>    * onServerClose indicates UDP wakeup.  If WS is closed due to socket error,
>-   * _socketError() is called.  The retry timer is started and when it times
>+   * _reconnectAfterBackoff() is called.  The retry timer is started and when it times
>    * out, beginWSSetup() is called again.

Line length.  :)

>-  _retryTimeoutTimer: null,
>+  _retryTimeoutAlarmID: null,

>@@ -492,50 +495,54 @@ this.PushService = {
>     // At this point, profile-change-net-teardown has already fired, so the
>     // WebSocket has been closed with NS_ERROR_ABORT (if it was up) and will
>     // try to reconnect. Stop the timer.
>-    if (this._retryTimeoutTimer)
>-      this._retryTimeoutTimer.cancel();
>+    if (this._retryTimeoutAlarmID) {

Do you know that 0 is an invalid AlarmID?  I suspect it's not invalid, in which
case this should check === null.

>+      AlarmService.remove(this._retryTimeoutAlarmID);
>+      this._retryTimeoutAlarmID = 0;

Then null here, too.

>   // aStatusCode is an NS error from Components.results
>-  _socketError: function(aStatusCode) {
>-    debug("socketError()");
>+  _reconnectAfterBackoff: function() {
>+    debug("reconnect()");

             reconnectAfterBackoff()?

>     // Calculate new timeout, but cap it to

Cap it to what?

>@@ -567,21 +574,92 @@ this.PushService = {
>     else {
>       debug("Unsupported websocket scheme " + uri.scheme);
>       return;
>     }
> 
>     debug("serverURL: " + uri.spec);
>     this._wsListener = new PushWebSocketListener(this);
>     this._ws.protocol = "push-notification";
>-    this._ws.pingInterval = prefs.get("websocketPingInterval");
>     this._ws.asyncOpen(uri, serverURL, this._wsListener, null);
>     this._currentState = STATE_WAITING_FOR_WS_START;
>   },
> 
>+  _stopPingIntervalAlarm: function() {
>+    if (this._pingIntervalAlarmID) {

=== null

>+      AlarmService.remove(this._pingIntervalAlarmID);
>+      this._pingIntervalAlarmID = 0;

null

>+    }
>+  },
>+
>+  _stopPongTimeoutAlarm: function() {
>+    if (this._pongAlarmID) {

ibid

>+      AlarmService.remove(this._pongAlarmID);
>+      this._pongAlarmID = 0;

ibid

>@@ -1124,18 +1204,20 @@ this.PushService = {
>   _wsOnStart: function(context) {
>     debug("wsOnStart()");
>     if (this._currentState != STATE_WAITING_FOR_WS_START) {
>       debug("NOT in STATE_WAITING_FOR_WS_START. Current state " +
>             this._currentState + ". Skipping");
>       return;
>     }
> 
>-    if (this._retryTimeoutTimer)
>-      this._retryTimeoutTimer.cancel();
>+    if (this._retryTimeoutAlarmID) {

ibid

>+      AlarmService.remove(this._retryTimeoutAlarmID);
>+      this._retryTimeoutAlarmID = 0;

ibid
Attachment #744491 - Flags: review?(justin.lebar+bug)
Changed 'invalid alarm ID' to null.

All alarms unified as setAlarm(), stopAlarm() and onAlarmFired() with the various drivers of the system commented.
Attachment #744491 - Attachment is obsolete: true
Attachment #745092 - Flags: review?(justin.lebar+bug)
Comment on attachment 745092 [details] [diff] [review]
Use RTC wakeup to monitor WebSocket connection.

This is a lot easier for me to understand.  Thanks!

Just some obligatory nits.

>+  _reconnectAfterBackoff: function() {
>+    debug("reconnectAfterBackoff()");
> 
>-    // Calculate new timeout, but cap it to
>+    // Calculate new timeout, but cap it to maxRetryInterval.

> pingInterval

>+  /**
>+   * There is only one alarm active at any time. This alarm has 3 intervals
>+   * corresponding to 3 tasks.
>+   *
>+   * 1) Pong timeout.
>+   *    If there is no pong received, the connection is closed and PushService
>+   *    tries to reconnect, repurposing the alarm for (3).

If we haven't received any messages from the server by the time this alarm fires,

>+   * 2) Send a ping.
>+   *    The protocol sends a ping ({}) on the wire every pingInterval ms. Once
>+   *    it sends the ping, the alarm goes to task (1) which is waiting for
>+   *    a pong. If data is received after the ping is sent,
>+   *    _wsOnMessageAvailable() will reset the ping alarm (which cancels
>+   *    waiting for the pong). So as long as the connection is fine, pong alarm
>+   *    never fires.
>+   *
>+   * 3) Reconnect after backoff.
>+   *    The alarm is set by _reconnectAfterBackoff() and increases in duration.

increases in duration every time we try and fail to connect.

>+   *    When it triggers, websocket setup begins again. On successful socket
>+   *    setup, the socket starts receiving messages. The alarm now goes to (2)
>+   *    where it monitors the WebSocket by sending a ping.  Since incoming data
>+   *    is a sign of the connection being up, the ping alarm is reset every
>+   *    time data is received.

This is a grammar nazi nit, but (1) (2) and (3) should all be parallel, because
they're a list.  "Pong timeout" is a different sort of thing than "Send a ping"
and "Reconnect after backoff".

>+      // Websocket is shutdown. Backoff interval expired, try to connect.

s/shutdown/shut down/  ("shutdown" is a noun, and "shut down" is a verb and
adjective.  This distinction is being lost in English, but I will not go down
without a fight!)
Attachment #745092 - Flags: review?(justin.lebar+bug) → review+
https://hg.mozilla.org/mozilla-central/rev/d4f14f6dd401
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
So this morning we hit a problem where a merge between two changesets that were green produced a changeset where most debug tests leaked.  In particular, this merge changeset:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa2ce091b9fe

I bisected one side of that merge (merging against the tip of the other side), and the guilty changeset ended up being:
https://hg.mozilla.org/mozilla-central/rev/d4f14f6dd401

Thus I backed it out in:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9ec0ad6f7e09

(I don't know what the guilty changeset in the other half of the merge is.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(And I merged the backout to Birch)
Whiteboard: [fixed-in-birch]
Some hg commands that may be helpful (also see "hg help revsets"):

$ hg log -raa2ce091b9fe
changeset:   131642:aa2ce091b9fe
parent:      131641:390d49fe8290
parent:      131596:af2811479de4
user:        Ryan VanderMeulen <ryanvm@gmail.com>
date:        Sat May 11 09:49:12 2013 -0400
summary:     Merge m-c to inbound on a CLOSED TREE.

$ hg log -r 'ancestor(390d49fe8290,af2811479de4)'
changeset:   131543:1d7c615b4b62
parent:      131480:2673016e7df4
parent:      131542:844adf9c7f2e
user:        Ryan VanderMeulen <ryanvm@gmail.com>
date:        Fri May 10 14:56:24 2013 -0400
summary:     Merge birch to m-c. IGNORE BAD COMMIT MESSAGES

$ hg log -r 'ancestors(390d49fe8290) - ancestors(1d7c615b4b62)' --template="{node}\n" | wc -l
62
$ hg log -r 'ancestors(af2811479de4) - ancestors(1d7c615b4b62)' --template="{node}\n" | wc -l
21


I did the bisection of the second set of 21 changesets, by merging different changesets in that set with 390d49fe8290.

The other bisection problem, which may or may not be useful to debug the underlying problem that led to the backout, is to bisect that other set of 62 changesets.
It is very likely that Bug 863599 combined with this to cause the leaks. AlarmService leaks in the current m-c, which doesn't show up on desktop because AlarmService isn't used. But this bug uses it in PushService which is loaded in deskto pby 863599. There is currently a patch for AlarmService leaks in Bug 867868. If that fixes all the leaks I'll file a new bug and land the fix, after which this bug can be landed again. Bug 863599 can wait a bit.

863599 has been backed out for other reasons.
https://hg.mozilla.org/mozilla-central/rev/2bfd3969400b
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Should we uplift this to b2g-18?

We are having problems with websockets keep disconnecting while testing the Push Server (traces on wireshark)
blocking-b2g: --- → tef?
Flags: needinfo?(dcoloma)
Does this fix the problem though?
Flags: needinfo?(dcoloma) → needinfo?(willyaranda)
blocking-b2g: tef? → leo?
(b2g-18 is v1.1 is leo?)

(In reply to Guillermo López (:willyaranda) from comment #31)
> Should we uplift this to b2g-18?
> 
> We are having problems with websockets keep disconnecting while testing the
> Push Server (traces on wireshark)

This is not tef, it should not land on 1.0.1, only on 1.1. At most it would be leo+, but I don't think it's exactly blocking either. So we should just ask for approval-mozilla-b2g18 on the patch. 

The rationale for asking for approval/uplift is that push as it's currently on 1.1 it's kinda useless (since once the screen shuts down it stops sending keepalives and thus you lose the 'push' service).

Can you do that, Nikhil?
blocking-b2g: leo? → ---
Comment on attachment 745092 [details] [diff] [review]
Use RTC wakeup to monitor WebSocket connection.


[Approval Request Comment]
Bug caused by (feature/regressing bug #): 822712
User impact if declined: Push won't work when the screen is switched off and phone is idling.
Testing completed: 
Risk to taking this patch (and alternatives if risky): 
String or UUID changes made by this patch:
None
Attachment #745092 - Flags: approval-mozilla-b2g18?
Patch does not apply correctly on b2g18, this is the patch for b2g18. (I need to test it but I don't have a Wireshark until tomorrow morning)
Flags: needinfo?(willyaranda)
Attachment #751492 - Attachment is patch: true
Attachment #751492 - Attachment mime type: text/x-patch → text/plain
(This one the good, I forgot to remove a line)
Attachment #751492 - Attachment is obsolete: true
I figured out how to test it on my home, and remember this:

https://uapush-nv.srv.openwebdevice.com/about

It keeps the "1" opened connection correctly (5 and 10 minutes test, and it does not reproduce bug 869922).

Will update tomorrow with more tests
willyaranda's patch had debug enabled.
Attachment #751494 - Attachment is obsolete: true
Comment on attachment 751713 [details] [diff] [review]
Use RTC wakeup to monitor WebSocket connection - b2g18

Marking correct patch as approval-mozilla-b2g18. Please see comment 35.
Attachment #751713 - Flags: approval-mozilla-b2g18?
This bug is missing a risk evaluation.
Oops!

This is a extremely low risk change which affects only Push. It uses an alternative wakeup timer system, there hasn't been any change in the push logic or API. There really aren't any alternatives as bug description states.
Attachment #751713 - Flags: approval-mozilla-b2g18? → approval-mozilla-b2g18+
You need to log in before you can comment on or make changes to this bug.