Closed Bug 696085 Opened 13 years ago Closed 12 years ago

WebSocket connection opening across page loads

Categories

(Core :: Networking: WebSockets, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13
Tracking Status
firefox12 - ---

People

(Reporter: emelia, Assigned: jduell.mcbugs)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files)

What did you do?

1. Visit http://test.pusher.com/
2. Open JavaScript Console
3. Select 1.8.6 or 1.9.3 from the list
4. See that a WebSocket Connection is opened.
5. Navigate away from the page (say by clicking the header back to http://test.pusher.com/ )


What happened?

When navigating, a "ghost" WebSocket connection is opened, you cannot interact with this connection on the new page, and the new page definitely does not establish the connection.


What should have happened?

There should have been no WebSocket connection opened when navigating away from the page.

More Information:

In the pusher javascript library, when we receive onclose on a websocket connection, we call another function, which in turn sets a timeout of 0ms to call a function that creates a new websocket connection.

That said, I've still not been able to create a static HTML test page which doesn't use the Pusher javascript. So, for now, this issue is without a solid test case, the best I can give you are my steps to reproduce above.
Component: General → Networking: WebSockets
Product: Firefox → Core
QA Contact: general → networking.websockets
I'm doubtful that this is a problem with the websocket implementation, but we'll leave it here until some kind of reduction in the testcase occurs to make it more clear as to who's at fault.
Well, it may not be an issue with the websocket implementation, but the fact that code is running across a page load is a bit strange. We're going to try and reduce down a test case.
Okay, we've just managed to reduce down a testcase using a raw websocket object, a simple sinatra file server, and an eventmachine websocket server. 

See: https://gist.github.com/1308932
Oh, some additional information: 

In the sinatra app, we have a 0.1 second delay loading the index (non-websocket) page, this is to simulate network latency, as we did not see this bug without that delay.
I talked to Micheil at a conference the other day, he's now able to reproduce on-demand with the testcase referenced in comment #3.

Can someone familiar with the websockets implementation code take a look at the testcase, or suggest a better way to reduce this scenario so that the bug is clearer?
I'll look at this soon.  Thanks a lot for the testcase--those are invaluable.
biesi suggests we make sure the WebsocketChannel is in the correct load group--we should be receiving a cancel when the page navigation happens.  I'll see if that's the case when I get a chance to look at this--if anyone else wants to try first, go for it.
Tried to repro using the git example but running into infrastructure issues.  I ran 'bundle install' but when I try to run 'ruby echo-ws' I get

  echo-ws.rb:1:in `require': no such file to load -- bundler/setup (LoadError)

I'm working on getting this resolved (possibly by just updating my whole ubuntu install), but if this issue rings a bell for anyone let me know.
Hi Jason,

This test case was built on ruby 1.9.2 / 1.9.3, you may also find that this issue could be resolved by running `bundle exec ruby echo-ws`.
I believe this could leak to memory leaks.
Whiteboard: [MemShrink]
Note per comment 0 that this only seems to be happening if the web page opens a websocket during onclose.  So it's not something that will happen for every page that uses websockets.   That doesn't mean it doesn't need to be fixed--it does--just FYI for prioritization's sake.
Haven't manage to reproduce yet.

But whatever I saw with IRCCloud was adding hundreds of milliseconds to each CC time.
It was a huge CC graph kept alive by websocket object.
It is possible that the leak wasn't this bug but something else.
OS: Mac OS X → All
Hardware: x86 → All
Version: 7 Branch → Trunk
> But whatever I saw with IRCCloud was adding hundreds of milliseconds to each CC time.
> It was a huge CC graph kept alive by websocket object.

I thought irccloud didn't use websockets in Firefox.  Did they change that recently?
(In reply to Justin Lebar [:jlebar] from comment #13)
> > But whatever I saw with IRCCloud was adding hundreds of milliseconds to each CC time.
> > It was a huge CC graph kept alive by websocket object.
> 
> I thought irccloud didn't use websockets in Firefox.  Did they change that
> recently?

when we unprefixed.
Assignee: nobody → jduell.mcbugs
Whiteboard: [MemShrink] → [MemShrink:P2]
Attached patch WIPSplinter Review
untested patch. compiling and testing whether it affects to irccloud case.
Comment on attachment 592523 [details] [diff] [review]
WIP

https://tbpl.mozilla.org/?tree=Try&rev=ecdb70c3517c

We should perhaps do something in necko level too.
I'm still testing if this helps with irccloud case.
Attachment #592523 - Flags: feedback?(jduell.mcbugs)
Hmm, the patch is not enough.
Jason, can Necko keep WebSocket object alive?
I'll look ASAP into whether a loadgroup bug might be keeping WS channels alive.  It's high on my TODO list.
Please re-nominate for tracking once it's clearer if this bug specifically is causing memory leaks or other user pain.
Attached file IRC chat with bz
bz/smaug,

I've got some idea of what's going here, but I'm still not sure of the right
solution/hack to fix this.

Here's what's going on (which I've figured out with the help of bz on IRC: see
attachment)

1) When we cancel a page because we're navigating away, the cancel happens before the cancelled onstop message gets dispatched, so any websocket created during onstop() doesn't get cancelled and exists as a "ghost" websocket until the server closes it (which can be a long time) or the user does another navigation (see #2).  nsWebSocket has a reference to the inner window that created it, and I assume that's what's blocking cycle collection.

2) There's no way at present to have the loadgroup notice and cancel any such late-arriving channels, because the docshell is already reusing the same loadgroup for the document that's being navigated-to.  So websockets created in onclose() logically become part of the loadgroup of the new document, and don't get canceled until that document gets canceled, for instance when we navigate to yet another page.  I've verified that I see the websocket destroyed at the next navigation, or if the tab is closed.
   
3) This is essentially a design flaw in the way LoadGroups work, and the ideal solution is to re-engineer them so they are specific to a document, not per-docshell (see bz's comment in the IRC chat), so we can do late cancellation.  But that's not a short-term solution in bz's opinion.

4) The most promising solution/hack to try in the meantime is for us to check and destroy the websocket if it detects that its mOwner is no longer the active inner window of the outer window that created it.  (This is what the existing CheckInnerWindowCorrectness() function does).  The problem is that there doesn't seem to be a deterministic point at which we can check this.  nsWebSocket::Init (called during the onclose event) seems to be too early: CheckInnerWindowCorrectness always returns ok for me.  Calling CheckInnerWindowCorrectness during nsWebsocket:OnStart (as Ollie's patch does) sometimes fails, but there appears to be a race, as I'm seeing that sometimes the inner window is still the active one.  We can add more CheckInnerWindowCorrectness calls that fail the websocket in Send() and OnMessageAvailable() (we have some checks already but they don't fail the WS, just skip dispatching events), but we have no guarantee that they will be called, i.e. if the websocket is reopened and neither the client nor the server sends a message we'll keep the WS alive indefinitely.  We could add code to also check window correctness during websocket protocol pings, but we currently have pings disabled.  We could add a one-time timer that goes off at some (probabilistically safe) point after OnStart and calls CheckInnerWindowCorrectness.  But I'm wondering if there's a non-probabilistic way to know when we can do this check.

Am I safe in assuming that it's always ok to fail a websocket if CheckInnerWindowCorrectness() fails?  I assume so.
jst/mrbkap,

bz suggests you may know the best place to hook in a callback that will let a websocket know that its inner window is no longer the active window.

See above (especially #4 of last comment) and this IRC chat:

<jduell> bz: ideally it'd be nice to be able to know the window != outer.inner 
         thing *before* we launch the new websocket network connection.
<bz> jduell: how about just checking in Init() and then adding a listener 
     for when the window stops being the current one?
<jduell> bz: how do I add a listener for that?  
<bz> jduell: simplestthing, I suspect, is to just register our stuff 
     on the window somewhere
<bz> jduell: and then in nsGlobalWindow::PageHidden kill them all off
<bz> jduell: though we still fire unload after that point...
<bz> jduell: one second...
<bz> jduell: perhaps kill websockets in nsGlobalWindow::Freeze?
<bz> jduell: mrbkap or jst might know better what the right place 
             to hook into this stuff is.  :(
Attachment #592523 - Flags: feedback?(jduell.mcbugs)
This patch kills ghost websockets (i.e. created during onclose when page is going away) by observing window FROZEN/DESTROYED notifications (I also shoved in some checks for window correctness in Init/OnStart, but the window generally hasn't changed by that point, so they're not enough).

The major suboptimal "misfeature" about this fix is that ghost websockets still get created and will generally connect to the server.  If the connection happens quickly enough, it may receive messages and take action on them (change the page, etc: I can see this in mochitest--it reports test success again before the page I'm navigating to loads).

I thought of an alternate hack that may work better than (or in addition to) this patch:  Is it true that nsWebSocket::Cancel will only be called by things that will eventually cause checkWindowCorectness to be false (i.e. page navigation, window destroyed/frozen)?  It's never called by WebSocketChannel, so I'm hoping this assumption is true.  If so we could have any websocket that observes Cancel register its inner window with the websocket protocol handler as a "cancelled window", and we could then block any new websockets from being created by that window.  The protocol handler could remove/release the window from its records when it gets notification that the window is frozen/destroyed.  Would that work?  It has the benefit that we'd be able to stop ghost websockets at Init(), before they connect to the server.

I'm not very happy with the console message we display here.  With this patch we show "The connection to %S was interrupted while the page was loading."  Which isn't way off, but also isn't really as good as something like "The Websocket connection to %S was disconnected due to its web page closing" or something like that.  I'm planning to open a new bug to improve WS console messages generally, but we can deal with it now if that's preferable.
Attachment #598476 - Flags: review?(bugs)
Comment on attachment 598476 [details] [diff] [review]
Remove "ghost" websockets via DOM window notifications v1

Patrick, there's only a small /netwerk change here (though you're welcome to read the whole patch if you're into that kind of thing :)

I'm seeing that we often wind up calling WebSocketChannel::Close multiple times (which currently generates a warning).  In part this is because we wind up calling CloseConnection from OnServerClose even if the client initiated the close (I've got a patch to fix that in another bug).  But it can also happen from a Close followed by a Cancel (user shuts page, navigates away).  I could add logic to make sure nsWebSocket only ever calls WebSocketChannel once, but it seems easier to just treat 2nd+ calls as a harmless no-op.
Attachment #598476 - Flags: review?(mcmanus)
Comment on attachment 598476 [details] [diff] [review]
Remove "ghost" websockets via DOM window notifications v1


>+  // Attempt to kill "ghost" websocket: but usually too early for check to fail
>+  rv = CheckInnerWindowCorrectness();
>+  if (NS_FAILED(rv)) {
>+    return NS_ERROR_FAILURE;  // SMAUG: is there a better error to return?
>+  }
Just do rv = CheckInnerWindowCorrectness()
NS_ENSURE_SUCCESS(rv, rv);

>+
>+  // Shut down websocket if window frozen or destroyed (only needed for "ghost"
>+  // websockets--see bug 696085)
if window is ...
Attachment #598476 - Flags: review?(bugs) → review+
Attachment #598476 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/415feacdd6f5

Ollie--do you think this merits nominating for aurora?  I'd lean towards no, but I don't know how bad the memory leak was (remember it goes away after another page navigation AFAICT).
https://hg.mozilla.org/mozilla-central/rev/415feacdd6f5
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
I have just spent some time digging into this since we (at pusher.com) are still seeing some issues related to Firefox.

The bad news is that, although the workaround works to some extent (I've been testing in Firefox 13.0.1 on OS X), there is a new race condition which is even worse (the ghost connection lives even after the tab is closed). I have fully documented how to replicate in my updated gist (https://gist.github.com/1308932 - under firefox 13).

I tried changing the timeout in `onclose` to control when the reconnect happens. This is a summary of what happens with different timeout values:

Timeout 0 case: `onclose` is called (with `wasClean = true`) when you click to navigate away from the page, then a new connection is opened (by the `onclose` handler). This connection then appears to be closed by firefox almost instantly (presumably this is the workaround to this bug, so that the ghost connection is not left lying about).

Timeout of 90ms in `onclose`: Exactly the same behaviour as case 0ms above.

Timeout of 100ms in `onclose`: Sometimes behaves as 0ms, sometimes as 103ms.

Timeout of 103ms in `onclose`: Inconsistent behaviour - there is definitely a race condition in firefox here. Usually this works perfectly (i.e. existing connection is closed and new one isn't opened), but sometimes a new connection is established, and it's left around as a ghost connection after exiting the page. This is basically the behaviour I saw when we initially reported this issue, but it's even worse: **The TCP connection actually stays around even when the browser tab is closed, and is only closed when Firefox quits!**

My hunch based on all this experimentation is that if a WebSocket connection is in the process of opening when the page is finally cleaned up by firefox (which seems to be after ~ 105ms), that the connection is orphaned forever. This could potentially leave Firefox amassing large numbers of ghost WebSocket connections.
> I have just spent some time digging into this since we (at pusher.com) are still seeing some issues

This sounds serious, but could you please file a new bug and mark it as blocking this bug?  We try not to recycle existing bugs, especially when they've been closed for months.
Sure, that makes sense. It's here: https://bugzilla.mozilla.org/show_bug.cgi?id=765738
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: