Closed Bug 711793 Opened 13 years ago Closed 12 years ago

Delay websocket reconnection after abnormal termination

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla16

People

(Reporter: jduell.mcbugs, Assigned: jduell.mcbugs)

References

Details

Attachments

(6 files)

part 1: convert mOpenRunning/Blocked into enum 12 years ago Jason Duell 8.97 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
part 2: just move nsWSAdmissionManager code 12 years ago Jason Duell 11.82 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
part 3: main patch 12 years ago Jason Duell 30.05 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
1st fix to patch 3: lower & slower delays 12 years ago Jason Duell 3.26 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
2nd fix to patch 3: pref to disable delays 12 years ago Jason Duell 6.39 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
nits and memory leak fix to part 3. 12 years ago Jason Duell 7.41 KB, patch		Details \| Diff \| Splinter Review

Jason Duell

Assignee

Description

•

13 years ago

Section 7.2.3 of the RFC says we should add delays if websocket clients try to reconnect after being abnormally terminated.

Jason Duell

Assignee

Comment 1

•

12 years ago

Attached patch part 1: convert mOpenRunning/Blocked into enum — Details — Splinter Review

The rest of the fix requires an additional state (delayed) in addition to connecting/queued, and the states are mutually exclusive, so easier at this point to use an enum.   Note that CONNECTING_DELAYED is not actually used yet.  

No real code changes here, other than some prep work for states where delayed will also be possibly valid value, not just connecting.

Assignee: nobody → jduell.mcbugs

Attachment #637034 - Flags: review?(mcmanus)

Jason Duell

Assignee

Comment 2

•

12 years ago

Attached patch part 2: just move nsWSAdmissionManager code — Details — Splinter Review

This just moves nsWSAdmissionManager higher up in the file, since the main patch needs to refer to it in CallOnStop.

No code changes, just makes diff of main patch easier to read/review (aww, how nice of me! :)

Attachment #637035 - Flags: review?(mcmanus)

Jason Duell

Assignee

Comment 3

•

12 years ago

Attached patch part 3: main patch — Details — Splinter Review

The meat of it.

I may add a few more things before this lands, but they don't need to stop review:

1) make the initial/max reconnect delay values configurable, and also provide a pref to completely turn off reconnect delays (note: do you think it's worth it?  It won't stop us from hitting new code--there's too many changes here);  

2) add logic that automatically cancels all delays and connects immediately if we're notified that the network is back online ('link-status-changed' sounds like the right frob, from talking with Honza): since delays are in the RFC just to prevent DOS-like hordes of reconnects when servers are temporarily down/offline, I don't think we need to do them when we determine it's the client that's having the availability issue.  

I may just wind up doing those in a followup, since the code so far is already pretty big and self-contained.

I may leave one or both of these to followup bugs (esp #2), and these are ready to review as-is I think.

Attachment #637039 - Flags: review?(mcmanus)

Jason Duell

Assignee

Comment 4

•

12 years ago

Note on testing:  this is a hard patch to write an automated test for: it would require us to make pywebsocket available, then not, then available again, etc.   I think it's easier to just extensively test by hand.

So far I've tested that:

  - reconnect delays occur
  - reconnect delays increase correctly to maximum value
  - FailDelay entries expire if no activity for designated time
  - FailDelay entries are deleted upon successful reconnection
  - websockets to other hosts are not delayed while delay is enforced
  - queued connections wait while delay is enforced.
  - queued connections to same host all connect immediately (serially,
    of course--but no delays) as soon as a delayed head-of-line
    connection to same host connects OK
  - opening/delayed/queued connections that are closed by tab
    close/navigation close correctly and do not count as failures for
    reconnect delay purposes
  - URIs without an explicit port number map to the same FailDelay entry
    as URIs with default port explicitly shown

I designed the logic very carefully, and it's required very little debugging, so I'm feeling pretty good about the risk level for an m-c landing.  Any thoughts on more cases to test (I need to setup an HSTS redirect, for one) are welcome.

Jason Duell

Assignee

Comment 5

•

12 years ago

Oh, and this code makes test_websocket.html take 10 more seconds to run (since we fail lots of connections in it), so I should figure something out there (I may either disable the pref, or split some of the failure tests into their own files).

Patrick McManus [:mcmanus]

Comment 6

•

12 years ago

so I'm still reading the code in detail, but the whole nature of this is vastly punitive for a non-existing problem. We want to be careful we don't build a feature that works-as-designed and slows stuff down!

As presented it's a avg 2500ms delay that doubles on each failure. Ouch!

rfc 6455 does not require any of this - the whole section is a should. so we need to be careful to balance the UX experience here with the need to not pound servers needlessly. Beyond that, the ~2500ms thing and a doubling are so weakly worded in 6455 as to be nothing more than examples. Both parameters are explicitly left to clients.

What's important here is that there is a backoff. In a severe problem (what we're really worried about preventing) it will backoff to high levels quite quickly.

I would suggest something like a initial random seed between 200 and 400 ms with each failure adding 50% to the delay. That will still provide a radically different level of protection for the server over a "as fast as you can" policy.

I feel especially strong that the initial number ought to be small.. the exponential factor is less important - but a single disconnect (perhaps due to an isolated server crash) shouldn't add large amounts of latency to get going again.

Patrick McManus [:mcmanus]

Comment 7

•

12 years ago

Comment on attachment 637034 [details] [diff] [review]
part 1: convert mOpenRunning/Blocked into enum

Review of attachment 637034 [details] [diff] [review]:
-----------------------------------------------------------------

This is an improvement!

Attachment #637034 - Flags: review?(mcmanus) → review+

Patrick McManus [:mcmanus]

Comment 8

•

12 years ago

Comment on attachment 637035 [details] [diff] [review]
part 2: just move nsWSAdmissionManager code

Review of attachment 637035 [details] [diff] [review]:
-----------------------------------------------------------------

that was polite!

Attachment #637035 - Flags: review?(mcmanus) → review+

Patrick McManus [:mcmanus]

Comment 9

•

12 years ago

Comment on attachment 637039 [details] [diff] [review]
part 3: main patch

Review of attachment 637039 [details] [diff] [review]:
-----------------------------------------------------------------

I think the code logic is solid here.. I'm not going to r+ it at this iteration because I at least want to air the logic behind the backoff strategy as noted in comment 6 and this whole thing is risky enough that it needs a giant pre-off kill switch.

::: netwerk/protocol/websocket/WebSocketChannel.cpp
@@ +90,5 @@
> +// Stores entries (searchable by {host, port}) of connections that have recently
> +// failed, so we can do delay of reconnects per RFC 6455 Section 7.2.3
> +//-----------------------------------------------------------------------------
> +
> +// Constants (TODO: add about:config prefs to control each of these)

minimally I think you need a bool pref now to disable the feature entirely. It will be invaluable if we get bug reports and want the reporter to test with the pref off. Also to avoid code backout later on.

@@ +97,5 @@
> +PRUint32 kWSReconnectMaxFirstDelay  = 5  * 1000;
> +// Base lifetime (in ms) of a FailDelay: kept longer if more failures occur
> +PRUint32 kWSReconnectBaseLifeTime   = 60 * 1000;
> +// Maximum reconnect delay (in ms)
> +PRUint32 kWSReconnectMaxDelay       = 60 * 1000;

declare all these k* vars as const

@@ +99,5 @@
> +PRUint32 kWSReconnectBaseLifeTime   = 60 * 1000;
> +// Maximum reconnect delay (in ms)
> +PRUint32 kWSReconnectMaxDelay       = 60 * 1000;
> +
> +struct FailDelay {

later on you say "friend class FailDelay" .. let's call it a class. Personally, I tend to think if it has methods its a class :)

@@ +120,5 @@
> +  bool IsExpired(TimeStamp rightNow)
> +  {
> +    TimeDuration dur = rightNow - mLastFailure;
> +    PRUint32 sinceFail = (PRUint32) dur.ToMilliseconds();
> +    return sinceFail > (kWSReconnectBaseLifeTime + mNextDelay);

converting into int32 is always a sign I'm misusing a timestamp class :).. how do you feel about:

bool IsExpired(TimeStamp rightNow)
{
  return (mLastFailure + TimeDuration::FromMilliseconds(kWSReconnectBaseLifeTime + mNextDelay)) <= rightNow;
}

its not impt to me.

@@ +213,5 @@
> +
> +    TimeStamp rightNow = TimeStamp::Now();
> +
> +    // iterate from end, to make deletion indexing easier
> +    for (PRInt32 i = mEntries.Length() -1; i >= 0; i--) {

pedantic.. "-1" to "- 1" and i-- to --i

Attachment #637039 - Flags: review?(mcmanus) → review-

Patrick McManus [:mcmanus]

Comment 10

•

12 years ago

(In reply to Patrick McManus [:mcmanus] from comment #9)
> 
> needs a giant pre-off kill switch.
> 

I mean pref-off kill switch, of course.

Jason Duell

Assignee

Comment 11

•

12 years ago

Attached patch 1st fix to patch 3: lower & slower delays — Details — Splinter Review

this is a patch on top of part 3.

Attachment #637655 - Flags: review?(mcmanus)

Jason Duell

Assignee

Comment 12

•

12 years ago

Attached patch 2nd fix to patch 3: pref to disable delays — Details — Splinter Review

Attachment #637656 - Flags: review?(mcmanus)

Patrick McManus [:mcmanus]

Updated

•

12 years ago

Attachment #637655 - Flags: review?(mcmanus) → review+

Patrick McManus [:mcmanus]

Updated

•

12 years ago

Attachment #637656 - Flags: review?(mcmanus) → review+

Patrick McManus [:mcmanus]

Comment 13

•

12 years ago

Comment on attachment 637039 [details] [diff] [review]
part 3: main patch

r=mcmanus with the 2 incremental patches.. if it were me I'd check it in with them all rolled into a new "part 3"

Attachment #637039 - Flags: review- → review+

Jason Duell

Assignee

Comment 14

•

12 years ago

fixed style nits, retested manual tests, landed on inbound:

https://hg.mozilla.org/integration/mozilla-inbound/rev/25663e7d7f56

Jason Duell

Assignee

Comment 15

•

12 years ago

backed out due to memory leak:

  https://tbpl.mozilla.org/php/getParsedLog.php?id=13084043&tree=Mozilla-Inbound#error0

Jason Duell

Assignee

Comment 16

•

12 years ago

Trivial enough--The FailDelayManager wasn't deleting any remaining FailDelay records at shutdown.

Note: I took a look at the records that were still present, and they were all bogus hosts ("this.websocket.server.probably.does.not.exist"), so it makes sense that the 60 second timeout for their FailDelay records hadn't expired and they hadn't been purged yet.

Now I'm waiting for a workaround to bug 766810 so I can actually land again.   Will just land on m-c tomorrow if a fix doesn't materialize before then.

Josh Aas

Comment 17

•

12 years ago

Glad the leak was easy to fix. Can you post an updated patch with the fixes since your last patch here, including the memory leak?

Jason Duell

Assignee

Comment 18

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/219499cc5eff

I'll post the interdiff of the nits (I got a verbal +r on them from patrick) and the memory leak fix.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla16

Jason Duell

Assignee

Comment 19

•

12 years ago

Attached patch nits and memory leak fix to part 3. — Details — Splinter Review

The memory leak fix is in ~FailDelayManager()

Jason Duell

Assignee

Updated

•

12 years ago

Blocks: 771318

Daniel Veditz [:dveditz]

Updated

•

12 years ago

No longer blocks: 771318

Depends on: 771318

Jesse Ruderman

Updated

•

10 years ago

Depends on: 1017645

Daniel Veditz [:dveditz]

Updated

•

9 years ago

Depends on: 1186160

Randell Jesup [:jesup] (needinfo me)

Updated

•

2 years ago

Regressions: 1781065

Valentin Gosu [:valentin] (he/him)

Updated

•

2 years ago

Updated

•

2 years ago

No longer regressions: 1781065

Valentin Gosu [:valentin] (he/him)

Updated

•

2 years ago

Updated

•

1 year ago