Closed Bug 711793 Opened 13 years ago Closed 12 years ago

Delay websocket reconnection after abnormal termination

Categories

(Core :: Networking: WebSockets, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla16

People

(Reporter: jduell.mcbugs, Assigned: jduell.mcbugs)

References

Details

Attachments

(6 files)

Section 7.2.3 of the RFC says we should add delays if websocket clients try to reconnect after being abnormally terminated.
The rest of the fix requires an additional state (delayed) in addition to connecting/queued, and the states are mutually exclusive, so easier at this point to use an enum.   Note that CONNECTING_DELAYED is not actually used yet.  

No real code changes here, other than some prep work for states where delayed will also be possibly valid value, not just connecting.
Assignee: nobody → jduell.mcbugs
Attachment #637034 - Flags: review?(mcmanus)
This just moves nsWSAdmissionManager higher up in the file, since the main patch needs to refer to it in CallOnStop.

No code changes, just makes diff of main patch easier to read/review (aww, how nice of me! :)
Attachment #637035 - Flags: review?(mcmanus)
The meat of it.

I may add a few more things before this lands, but they don't need to stop review:

1) make the initial/max reconnect delay values configurable, and also provide a pref to completely turn off reconnect delays (note: do you think it's worth it?  It won't stop us from hitting new code--there's too many changes here);  

2) add logic that automatically cancels all delays and connects immediately if we're notified that the network is back online ('link-status-changed' sounds like the right frob, from talking with Honza): since delays are in the RFC just to prevent DOS-like hordes of reconnects when servers are temporarily down/offline, I don't think we need to do them when we determine it's the client that's having the availability issue.  

I may just wind up doing those in a followup, since the code so far is already pretty big and self-contained.

I may leave one or both of these to followup bugs (esp #2), and these are ready to review as-is I think.
Attachment #637039 - Flags: review?(mcmanus)
Note on testing:  this is a hard patch to write an automated test for: it would require us to make pywebsocket available, then not, then available again, etc.   I think it's easier to just extensively test by hand.

So far I've tested that:

  - reconnect delays occur
  - reconnect delays increase correctly to maximum value
  - FailDelay entries expire if no activity for designated time
  - FailDelay entries are deleted upon successful reconnection
  - websockets to other hosts are not delayed while delay is enforced
  - queued connections wait while delay is enforced.
  - queued connections to same host all connect immediately (serially,
    of course--but no delays) as soon as a delayed head-of-line
    connection to same host connects OK
  - opening/delayed/queued connections that are closed by tab
    close/navigation close correctly and do not count as failures for
    reconnect delay purposes
  - URIs without an explicit port number map to the same FailDelay entry
    as URIs with default port explicitly shown

I designed the logic very carefully, and it's required very little debugging, so I'm feeling pretty good about the risk level for an m-c landing.  Any thoughts on more cases to test (I need to setup an HSTS redirect, for one) are welcome.
Oh, and this code makes test_websocket.html take 10 more seconds to run (since we fail lots of connections in it), so I should figure something out there (I may either disable the pref, or split some of the failure tests into their own files).
so I'm still reading the code in detail, but the whole nature of this is vastly punitive for a non-existing problem. We want to be careful we don't build a feature that works-as-designed and slows stuff down!

As presented it's a avg 2500ms delay that doubles on each failure. Ouch!

rfc 6455 does not require any of this - the whole section is a should. so we need to be careful to balance the UX experience here with the need to not pound servers needlessly. Beyond that, the ~2500ms thing and a doubling are so weakly worded in 6455 as to be nothing more than examples. Both parameters are explicitly left to clients.

What's important here is that there is a backoff. In a severe problem (what we're really worried about preventing) it will backoff to high levels quite quickly.

I would suggest something like a initial random seed between 200 and 400 ms with each failure adding 50% to the delay. That will still provide a radically different level of protection for the server over a "as fast as you can" policy.

I feel especially strong that the initial number ought to be small.. the exponential factor is less important - but a single disconnect (perhaps due to an isolated server crash) shouldn't add large amounts of latency to get going again.
Comment on attachment 637034 [details] [diff] [review]
part 1: convert mOpenRunning/Blocked into enum

Review of attachment 637034 [details] [diff] [review]:
-----------------------------------------------------------------

This is an improvement!
Attachment #637034 - Flags: review?(mcmanus) → review+
Comment on attachment 637035 [details] [diff] [review]
part 2: just move nsWSAdmissionManager code

Review of attachment 637035 [details] [diff] [review]:
-----------------------------------------------------------------

that was polite!
Attachment #637035 - Flags: review?(mcmanus) → review+
Comment on attachment 637039 [details] [diff] [review]
part 3: main patch

Review of attachment 637039 [details] [diff] [review]:
-----------------------------------------------------------------

I think the code logic is solid here.. I'm not going to r+ it at this iteration because I at least want to air the logic behind the backoff strategy as noted in comment 6 and this whole thing is risky enough that it needs a giant pre-off kill switch.

::: netwerk/protocol/websocket/WebSocketChannel.cpp
@@ +90,5 @@
> +// Stores entries (searchable by {host, port}) of connections that have recently
> +// failed, so we can do delay of reconnects per RFC 6455 Section 7.2.3
> +//-----------------------------------------------------------------------------
> +
> +// Constants (TODO: add about:config prefs to control each of these)

minimally I think you need a bool pref now to disable the feature entirely. It will be invaluable if we get bug reports and want the reporter to test with the pref off. Also to avoid code backout later on.

@@ +97,5 @@
> +PRUint32 kWSReconnectMaxFirstDelay  = 5  * 1000;
> +// Base lifetime (in ms) of a FailDelay: kept longer if more failures occur
> +PRUint32 kWSReconnectBaseLifeTime   = 60 * 1000;
> +// Maximum reconnect delay (in ms)
> +PRUint32 kWSReconnectMaxDelay       = 60 * 1000;

declare all these k* vars as const

@@ +99,5 @@
> +PRUint32 kWSReconnectBaseLifeTime   = 60 * 1000;
> +// Maximum reconnect delay (in ms)
> +PRUint32 kWSReconnectMaxDelay       = 60 * 1000;
> +
> +struct FailDelay {

later on you say "friend class FailDelay" .. let's call it a class. Personally, I tend to think if it has methods its a class :)

@@ +120,5 @@
> +  bool IsExpired(TimeStamp rightNow)
> +  {
> +    TimeDuration dur = rightNow - mLastFailure;
> +    PRUint32 sinceFail = (PRUint32) dur.ToMilliseconds();
> +    return sinceFail > (kWSReconnectBaseLifeTime + mNextDelay);

converting into int32 is always a sign I'm misusing a timestamp class :).. how do you feel about:

bool IsExpired(TimeStamp rightNow)
{
  return (mLastFailure + TimeDuration::FromMilliseconds(kWSReconnectBaseLifeTime + mNextDelay)) <= rightNow;
}

its not impt to me.

@@ +213,5 @@
> +
> +    TimeStamp rightNow = TimeStamp::Now();
> +
> +    // iterate from end, to make deletion indexing easier
> +    for (PRInt32 i = mEntries.Length() -1; i >= 0; i--) {

pedantic.. "-1" to "- 1" and i-- to --i
Attachment #637039 - Flags: review?(mcmanus) → review-
(In reply to Patrick McManus [:mcmanus] from comment #9)
> 
> needs a giant pre-off kill switch.
> 

I mean pref-off kill switch, of course.
this is a patch on top of part 3.
Attachment #637655 - Flags: review?(mcmanus)
Attachment #637656 - Flags: review?(mcmanus)
Attachment #637655 - Flags: review?(mcmanus) → review+
Attachment #637656 - Flags: review?(mcmanus) → review+
Comment on attachment 637039 [details] [diff] [review]
part 3: main patch

r=mcmanus with the 2 incremental patches.. if it were me I'd check it in with them all rolled into a new "part 3"
Attachment #637039 - Flags: review- → review+
fixed style nits, retested manual tests, landed on inbound:

https://hg.mozilla.org/integration/mozilla-inbound/rev/25663e7d7f56
Trivial enough--The FailDelayManager wasn't deleting any remaining FailDelay records at shutdown.

Note: I took a look at the records that were still present, and they were all bogus hosts ("this.websocket.server.probably.does.not.exist"), so it makes sense that the 60 second timeout for their FailDelay records hadn't expired and they hadn't been purged yet.

Now I'm waiting for a workaround to bug 766810 so I can actually land again.   Will just land on m-c tomorrow if a fix doesn't materialize before then.
Glad the leak was easy to fix. Can you post an updated patch with the fixes since your last patch here, including the memory leak?
https://hg.mozilla.org/mozilla-central/rev/219499cc5eff

I'll post the interdiff of the nits (I got a verbal +r on them from patrick) and the memory leak fix.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla16
The memory leak fix is in ~FailDelayManager()
Blocks: 771318
No longer blocks: 771318
Depends on: 771318
Depends on: 1017645
Depends on: 1186160
No longer regressions: 1781065
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: