Last Comment Bug 711793 - Delay websocket reconnection after abnormal termination
: Delay websocket reconnection after abnormal termination
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking: WebSockets (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla16
Assigned To: Jason Duell [:jduell] (needinfo me)
:
Mentors:
Depends on: 1017645 771318 1186160
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-17 21:12 PST by Jason Duell [:jduell] (needinfo me)
Modified: 2015-12-01 17:33 PST (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
part 1: convert mOpenRunning/Blocked into enum (8.97 KB, patch)
2012-06-27 01:46 PDT, Jason Duell [:jduell] (needinfo me)
mcmanus: review+
Details | Diff | Splinter Review
part 2: just move nsWSAdmissionManager code (11.82 KB, patch)
2012-06-27 01:48 PDT, Jason Duell [:jduell] (needinfo me)
mcmanus: review+
Details | Diff | Splinter Review
part 3: main patch (30.05 KB, patch)
2012-06-27 01:59 PDT, Jason Duell [:jduell] (needinfo me)
mcmanus: review+
Details | Diff | Splinter Review
1st fix to patch 3: lower & slower delays (3.26 KB, patch)
2012-06-28 13:46 PDT, Jason Duell [:jduell] (needinfo me)
mcmanus: review+
Details | Diff | Splinter Review
2nd fix to patch 3: pref to disable delays (6.39 KB, patch)
2012-06-28 13:47 PDT, Jason Duell [:jduell] (needinfo me)
mcmanus: review+
Details | Diff | Splinter Review
nits and memory leak fix to part 3. (7.41 KB, patch)
2012-06-28 19:35 PDT, Jason Duell [:jduell] (needinfo me)
no flags Details | Diff | Splinter Review

Description Jason Duell [:jduell] (needinfo me) 2011-12-17 21:12:53 PST
Section 7.2.3 of the RFC says we should add delays if websocket clients try to reconnect after being abnormally terminated.
Comment 1 Jason Duell [:jduell] (needinfo me) 2012-06-27 01:46:01 PDT
Created attachment 637034 [details] [diff] [review]
part 1: convert mOpenRunning/Blocked into enum

The rest of the fix requires an additional state (delayed) in addition to connecting/queued, and the states are mutually exclusive, so easier at this point to use an enum.   Note that CONNECTING_DELAYED is not actually used yet.  

No real code changes here, other than some prep work for states where delayed will also be possibly valid value, not just connecting.
Comment 2 Jason Duell [:jduell] (needinfo me) 2012-06-27 01:48:37 PDT
Created attachment 637035 [details] [diff] [review]
part 2: just move nsWSAdmissionManager code

This just moves nsWSAdmissionManager higher up in the file, since the main patch needs to refer to it in CallOnStop.

No code changes, just makes diff of main patch easier to read/review (aww, how nice of me! :)
Comment 3 Jason Duell [:jduell] (needinfo me) 2012-06-27 01:59:10 PDT
Created attachment 637039 [details] [diff] [review]
part 3: main patch

The meat of it.

I may add a few more things before this lands, but they don't need to stop review:

1) make the initial/max reconnect delay values configurable, and also provide a pref to completely turn off reconnect delays (note: do you think it's worth it?  It won't stop us from hitting new code--there's too many changes here);  

2) add logic that automatically cancels all delays and connects immediately if we're notified that the network is back online ('link-status-changed' sounds like the right frob, from talking with Honza): since delays are in the RFC just to prevent DOS-like hordes of reconnects when servers are temporarily down/offline, I don't think we need to do them when we determine it's the client that's having the availability issue.  

I may just wind up doing those in a followup, since the code so far is already pretty big and self-contained.

I may leave one or both of these to followup bugs (esp #2), and these are ready to review as-is I think.
Comment 4 Jason Duell [:jduell] (needinfo me) 2012-06-27 02:08:09 PDT
Note on testing:  this is a hard patch to write an automated test for: it would require us to make pywebsocket available, then not, then available again, etc.   I think it's easier to just extensively test by hand.

So far I've tested that:

  - reconnect delays occur
  - reconnect delays increase correctly to maximum value
  - FailDelay entries expire if no activity for designated time
  - FailDelay entries are deleted upon successful reconnection
  - websockets to other hosts are not delayed while delay is enforced
  - queued connections wait while delay is enforced.
  - queued connections to same host all connect immediately (serially,
    of course--but no delays) as soon as a delayed head-of-line
    connection to same host connects OK
  - opening/delayed/queued connections that are closed by tab
    close/navigation close correctly and do not count as failures for
    reconnect delay purposes
  - URIs without an explicit port number map to the same FailDelay entry
    as URIs with default port explicitly shown

I designed the logic very carefully, and it's required very little debugging, so I'm feeling pretty good about the risk level for an m-c landing.  Any thoughts on more cases to test (I need to setup an HSTS redirect, for one) are welcome.
Comment 5 Jason Duell [:jduell] (needinfo me) 2012-06-27 02:11:35 PDT
Oh, and this code makes test_websocket.html take 10 more seconds to run (since we fail lots of connections in it), so I should figure something out there (I may either disable the pref, or split some of the failure tests into their own files).
Comment 6 Patrick McManus [:mcmanus] 2012-06-28 09:28:51 PDT
so I'm still reading the code in detail, but the whole nature of this is vastly punitive for a non-existing problem. We want to be careful we don't build a feature that works-as-designed and slows stuff down!

As presented it's a avg 2500ms delay that doubles on each failure. Ouch!

rfc 6455 does not require any of this - the whole section is a should. so we need to be careful to balance the UX experience here with the need to not pound servers needlessly. Beyond that, the ~2500ms thing and a doubling are so weakly worded in 6455 as to be nothing more than examples. Both parameters are explicitly left to clients.

What's important here is that there is a backoff. In a severe problem (what we're really worried about preventing) it will backoff to high levels quite quickly.

I would suggest something like a initial random seed between 200 and 400 ms with each failure adding 50% to the delay. That will still provide a radically different level of protection for the server over a "as fast as you can" policy.

I feel especially strong that the initial number ought to be small.. the exponential factor is less important - but a single disconnect (perhaps due to an isolated server crash) shouldn't add large amounts of latency to get going again.
Comment 7 Patrick McManus [:mcmanus] 2012-06-28 10:06:56 PDT
Comment on attachment 637034 [details] [diff] [review]
part 1: convert mOpenRunning/Blocked into enum

Review of attachment 637034 [details] [diff] [review]:
-----------------------------------------------------------------

This is an improvement!
Comment 8 Patrick McManus [:mcmanus] 2012-06-28 10:07:00 PDT
Comment on attachment 637035 [details] [diff] [review]
part 2: just move nsWSAdmissionManager code

Review of attachment 637035 [details] [diff] [review]:
-----------------------------------------------------------------

that was polite!
Comment 9 Patrick McManus [:mcmanus] 2012-06-28 10:07:24 PDT
Comment on attachment 637039 [details] [diff] [review]
part 3: main patch

Review of attachment 637039 [details] [diff] [review]:
-----------------------------------------------------------------

I think the code logic is solid here.. I'm not going to r+ it at this iteration because I at least want to air the logic behind the backoff strategy as noted in comment 6 and this whole thing is risky enough that it needs a giant pre-off kill switch.

::: netwerk/protocol/websocket/WebSocketChannel.cpp
@@ +90,5 @@
> +// Stores entries (searchable by {host, port}) of connections that have recently
> +// failed, so we can do delay of reconnects per RFC 6455 Section 7.2.3
> +//-----------------------------------------------------------------------------
> +
> +// Constants (TODO: add about:config prefs to control each of these)

minimally I think you need a bool pref now to disable the feature entirely. It will be invaluable if we get bug reports and want the reporter to test with the pref off. Also to avoid code backout later on.

@@ +97,5 @@
> +PRUint32 kWSReconnectMaxFirstDelay  = 5  * 1000;
> +// Base lifetime (in ms) of a FailDelay: kept longer if more failures occur
> +PRUint32 kWSReconnectBaseLifeTime   = 60 * 1000;
> +// Maximum reconnect delay (in ms)
> +PRUint32 kWSReconnectMaxDelay       = 60 * 1000;

declare all these k* vars as const

@@ +99,5 @@
> +PRUint32 kWSReconnectBaseLifeTime   = 60 * 1000;
> +// Maximum reconnect delay (in ms)
> +PRUint32 kWSReconnectMaxDelay       = 60 * 1000;
> +
> +struct FailDelay {

later on you say "friend class FailDelay" .. let's call it a class. Personally, I tend to think if it has methods its a class :)

@@ +120,5 @@
> +  bool IsExpired(TimeStamp rightNow)
> +  {
> +    TimeDuration dur = rightNow - mLastFailure;
> +    PRUint32 sinceFail = (PRUint32) dur.ToMilliseconds();
> +    return sinceFail > (kWSReconnectBaseLifeTime + mNextDelay);

converting into int32 is always a sign I'm misusing a timestamp class :).. how do you feel about:

bool IsExpired(TimeStamp rightNow)
{
  return (mLastFailure + TimeDuration::FromMilliseconds(kWSReconnectBaseLifeTime + mNextDelay)) <= rightNow;
}

its not impt to me.

@@ +213,5 @@
> +
> +    TimeStamp rightNow = TimeStamp::Now();
> +
> +    // iterate from end, to make deletion indexing easier
> +    for (PRInt32 i = mEntries.Length() -1; i >= 0; i--) {

pedantic.. "-1" to "- 1" and i-- to --i
Comment 10 Patrick McManus [:mcmanus] 2012-06-28 10:08:15 PDT
(In reply to Patrick McManus [:mcmanus] from comment #9)
> 
> needs a giant pre-off kill switch.
> 

I mean pref-off kill switch, of course.
Comment 11 Jason Duell [:jduell] (needinfo me) 2012-06-28 13:46:14 PDT
Created attachment 637655 [details] [diff] [review]
1st fix to patch 3: lower & slower delays

this is a patch on top of part 3.
Comment 12 Jason Duell [:jduell] (needinfo me) 2012-06-28 13:47:08 PDT
Created attachment 637656 [details] [diff] [review]
2nd fix to patch 3: pref to disable delays
Comment 13 Patrick McManus [:mcmanus] 2012-06-28 14:07:20 PDT
Comment on attachment 637039 [details] [diff] [review]
part 3: main patch

r=mcmanus with the 2 incremental patches.. if it were me I'd check it in with them all rolled into a new "part 3"
Comment 14 Jason Duell [:jduell] (needinfo me) 2012-06-28 14:44:24 PDT
fixed style nits, retested manual tests, landed on inbound:

https://hg.mozilla.org/integration/mozilla-inbound/rev/25663e7d7f56
Comment 15 Jason Duell [:jduell] (needinfo me) 2012-06-28 16:59:56 PDT
backed out due to memory leak:

  https://tbpl.mozilla.org/php/getParsedLog.php?id=13084043&tree=Mozilla-Inbound#error0
Comment 16 Jason Duell [:jduell] (needinfo me) 2012-06-28 18:37:08 PDT
Trivial enough--The FailDelayManager wasn't deleting any remaining FailDelay records at shutdown.

Note: I took a look at the records that were still present, and they were all bogus hosts ("this.websocket.server.probably.does.not.exist"), so it makes sense that the 60 second timeout for their FailDelay records hadn't expired and they hadn't been purged yet.

Now I'm waiting for a workaround to bug 766810 so I can actually land again.   Will just land on m-c tomorrow if a fix doesn't materialize before then.
Comment 17 Josh Aas 2012-06-28 19:06:21 PDT
Glad the leak was easy to fix. Can you post an updated patch with the fixes since your last patch here, including the memory leak?
Comment 18 Jason Duell [:jduell] (needinfo me) 2012-06-28 19:15:33 PDT
https://hg.mozilla.org/mozilla-central/rev/219499cc5eff

I'll post the interdiff of the nits (I got a verbal +r on them from patrick) and the memory leak fix.
Comment 19 Jason Duell [:jduell] (needinfo me) 2012-06-28 19:35:10 PDT
Created attachment 637768 [details] [diff] [review]
nits and memory leak fix to part 3.

The memory leak fix is in ~FailDelayManager()

Note You need to log in before you can comment on or make changes to this bug.