Last Comment Bug 712572 - hang forever in bilibili.tv since 20111220 nightly
: hang forever in bilibili.tv since 20111220 nightly
Status: VERIFIED FIXED
[sg:critical][qa!]
: crash, hang, regression
Product: Core
Classification: Components
Component: Networking: WebSockets (show other bugs)
: 11 Branch
: x86 Windows 7
: -- critical (vote)
: ---
Assigned To: Patrick McManus [:mcmanus]
:
Mentors:
http://www.bilibili.tv/
: 712940 713386 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-21 00:38 PST by dindog
Modified: 2012-04-10 21:28 PDT (History)
16 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-
wontfix
+
wontfix
+
verified
+
verified
11+
verified
unaffected


Attachments
Add null checks (1.04 KB, patch)
2011-12-23 01:32 PST, Jason Duell [:jduell] (needinfo? me)
no flags Details | Diff | Review
The old js trigger crash (38.78 KB, text/plain)
2012-01-01 10:26 PST, dindog
no flags Details
Reproduce the issue by redirecting (17.53 KB, image/jpeg)
2012-01-01 10:29 PST, dindog
no flags Details
patch v0 (1.36 KB, patch)
2012-01-04 15:19 PST, Patrick McManus [:mcmanus]
jduell.mcbugs: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta-
Details | Diff | Review
part 2 v 0 (8.81 KB, patch)
2012-01-05 14:52 PST, Patrick McManus [:mcmanus]
jduell.mcbugs: review+
Details | Diff | Review

Description dindog 2011-12-21 00:38:58 PST
STR:
open http://www.bilibili.tv/
stay a minute, hang, firefox.exe process take more and more memory.

Win 7 sp1 32bit

2011-12-19 nightly was fine.

It's any difference in 12a1&11a1 of 20121220 nightly?
Comment 1 dindog 2011-12-21 00:48:18 PST
Now I confirm this issue first show up in the last 11a1 nightly build
http://hg.mozilla.org/mozilla-central/rev/2afd7ae68e8b
Comment 2 dindog 2011-12-21 01:34:19 PST
in tinderbox build, here are the last work and first wrong build:
last work:
20111219201350
Built from http://hg.mozilla.org/mozilla-central/rev/9eda39f28e5a

first wrong:
20111220004016
Built from http://hg.mozilla.org/mozilla-central/rev/2afd7ae68e8b
Comment 3 dindog 2011-12-21 08:07:52 PST
After adblock plus block the bilibar.js, the issue gone, and I have a quick look at the JS, it does has something about Websocket, which the first go wrong build push-log is patch all about websocket.

Mr Jason Duell, would you please look into this issue?
Comment 4 Alice0775 White 2011-12-21 09:03:58 PST
Confirmed on
http://hg.mozilla.org/mozilla-central/rev/2afd7ae68e8b
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0a1) Gecko/20111220 Firefox/11.0a1 ID:20111220031159
and
http://hg.mozilla.org/mozilla-central/rev/cd921d073b22
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20111221 Firefox/12.0a1 ID:20111221031226

Browser eats memory and CPU power, and sometimes crashes finally.

Regression Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9eda39f28e5a&tochange=2afd7ae68e8b

Reproducible: Always

Steps to Reproduce:
1. Start Firefox with clean profile + Flash 11.1.102.55
2. Open ehttp://www.bilibili.tv/
3. Wait complete loading the page
4. Scroll to bottom
5. Wait for a while. (repeat scroll top/bottom if necessary)

bp-1f16df03-7c62-4669-971d-5ccaf2111221
Comment 5 dindog 2011-12-21 09:29:09 PST
chatRoomClass.prototype.initws = function(){}

There are enormous amount of this function iteration, and before yesterday, Firefox won't ran at all because

("WebSocket"in window) still undefined.


and I'm not sure is the function bad written or Firefox's fault, the iteration count is way more than the script's mean to. ("reconnectCount >30 should have a longer interval)


here is the code trigger this issue:
===================================================================
chatRoomClass.prototype.initws=function(){
	this.JSMode=true;
	self=this;
	if("WebSocket"in window){
		if(this.ws!=null){
			self.chatlog('',0,"[信息] 连接已存在...正在中断重新连接");
			this.ws.close();
		}
		resetConnection=true;alert('ran');
		this.ws=new WebSocket("ws://"+chatRoom.CHAT_SERVER+":88/"+this.currentChatRoom);
		this.status.innerHTML="WS连接中...";
		var self=this;
		this.ws.onopen=function(){clearInterval(reconnection);
		self.status.innerHTML='已连接.';
		resetConnection=false;$("#submit_msg")[0].disabled=false;
		if(self.onconnect!=null)self.onconnect();};
		this.ws.onclose=function(){self.status.innerHTML='已断开.';
		alert(reconnectCount)
			if(!resetConnection)self.chatlog('',0,"[信息] 与服务器连接中断...正在重新连接 重试次数: "+reconnectCount++);
			if(reconnectCount>30){
				reconnection=setInterval(function(){clearInterval(reconnection);self.initws();},30000);
				}
			else{
				reconnection=setInterval(function(){clearInterval(reconnection);self.initws();},1000);
				}
		};
		this.ws.onerror=function(){
			self.status.innerHTML='错误.';self.chatlog('',0,"[信息] 与服务器连接错误!");
			var reconnection=setInterval(function(){clearInterval(reconnection);self.initws();},1000);
			};
		this.ws.onmessage=function(msg){
			msgdata=msg.data.replace(/[\x5C]{2}/g,"\x5C");
			if(msg.data=="REST"){
					self.initws();
			}
			else{
				eval(msgdata);
			}
		};
	}
	else{
		this.chatlog('','SYSTEM','您的浏览器不支持Flash和WebSocket,无法使用本功能');
		}
};
Comment 6 Patrick McManus [:mcmanus] 2011-12-22 07:23:49 PST
*** Bug 712940 has been marked as a duplicate of this bug. ***
Comment 7 Patrick McManus [:mcmanus] 2011-12-22 07:24:23 PST
jason - is this on your radar? Its on aurora.
Comment 8 Jason Duell [:jduell] (needinfo? me) 2011-12-23 01:30:39 PST
On my radar indeed.  

Strange that we're getting a segfault here:

   http://hg.mozilla.org/mozilla-central/annotate/ed47a41ba26a/netwerk/protocol/websocket/WebSocketChannel.cpp#l163

given that CallOnStop is only launched if mListener != null, and there's no code (other than the WS destructor) that nulls out mListener once it's set.  (Actually we never allow mListener to be null once AsyncOpen is called).   And since CallOnStop keeps a references to the channel, we shouldn't be calling the destructor yet (nit: we use nsRefPtr<WebSocketChannel> in CallOnStop: should we be using nsCOMPtr instead?).

All crashes appear to be on Windows (with most but not all using Asian charsets), which makes me suspect malware or LSP issues.  But it's not just old versions of windows (includes 6.1).  

I don't have any clever ideas at the moment.  We can of course land a patch that checks mListener != null before calling the function.  Can't hurt.
Comment 9 Jason Duell [:jduell] (needinfo? me) 2011-12-23 01:32:19 PST
Created attachment 584023 [details] [diff] [review]
Add null checks

If this patch actually causes the crashes to go away we've got other problems, because we shouldn't actually hit null in this case.
Comment 10 Patrick McManus [:mcmanus] 2011-12-23 18:55:52 PST
Comment on attachment 584023 [details] [diff] [review]
Add null checks

So I was able to reproduce with a nightly-12 using the steps in comment 4.

It took it a couple minutes to crash and I watched the process sucking up RAM out of control in the meantime - so some kind of loop no doubt that resulted in weird failure (probably due to a failed alloc or unexpected return path). But it did happen the first time I tried it (including the scroll to top/bottom repeatedly bits - I note this makes a lot of different websocket connections).

Since there is a easy STR I'd rather understand the bug than take the null deref prevention path. Make sense to you too?
Comment 11 Jason Duell [:jduell] (needinfo? me) 2011-12-28 16:10:00 PST
OK, I've made some small amount of progress here.

bilbili.tv's websocket server ("Server: BiLiBiLi.tv ChatServer") is not doing a legal websocket handshake:  it is missing the required Sec-WebSocket-Accept header in it's reply, and so we fail the connection (as per the spec).   Chrome also appears to fail the connection AFAICT.

But...

We don't prevent the site from immediately trying to reconnect again, since we haven't yet introduced a delay to reconnecting (bug 711793).  The site apparently is doing exactly that.  

I'm still not sure of why this is causing so much CPU to be used: one would assume that there'd still be lots of time during the network round-trip when the CPU would be quiet.  I'm seeing around 90% CPU usage, vs ~15% use for Chrome, which also fails the websocket connection but seems to somehow stop any reconnect attempts after the 1st failure.

I'm not seeing any signs of a memory leak on Linux, so that part seems to be Windows-specific.

Will look into this further (help on the windows leak would be nice--I don't have a windows box convenient right now).
Comment 12 dindog 2011-12-28 22:25:28 PST
It would be nice to write a testcase for this, if bilibili.tv fixes it, we can't tell if this issue fixed or not.
Comment 13 dindog 2012-01-01 10:26:06 PST
Created attachment 585215 [details]
The old js trigger crash

The site has changed this part of code in bilibar.js... no crashing now... kind of like Chrome now.

But I think we still need to look into the problem. As Jason said, we haven't find out what cause the high CPU usage and high memory comsumption in Windows yet.
Comment 14 dindog 2012-01-01 10:29:19 PST
Created attachment 585216 [details]
Reproduce the issue by redirecting

I save an original crashing bilibar.js, and still can reproduce this bug by redirecting it to local file. ( using an addon name MASON, adding a rule like as in attachment.)
Comment 15 Jason Duell [:jduell] (needinfo? me) 2012-01-03 18:03:24 PST
dindog@163.com,

Can you attach a copy of the original bilibar.js so I can see if I can reproduce?  Thank you.
Comment 16 Jason Duell [:jduell] (needinfo? me) 2012-01-03 18:03:49 PST
*** Bug 713386 has been marked as a duplicate of this bug. ***
Comment 17 Jason Duell [:jduell] (needinfo? me) 2012-01-03 18:06:14 PST
dindog@163.com

Ah, I see you already attached it!  Thanks :)
Comment 18 Josh Aas 2012-01-03 18:08:01 PST
Patrick - can you take this bug?
Comment 19 Jason Duell [:jduell] (needinfo? me) 2012-01-03 18:10:29 PST
marking sg:critical per bug 713386 comment 0.

Bob: the crashing line is

 163     mChannel->mListener->OnStop(mChannel->mContext, mData);

Do you know from the crash report which variable/member has the bogus address?
Comment 20 Bob Clary [:bc:] 2012-01-04 06:02:28 PST
http://www.bilibili.tv/video/game.html

-		this	0x069cf5a0 {mRefCnt={...} _mOwningThread={...} mChannel={...} ...}	mozilla::net::CallOnStop * const
-		nsIRunnable	{...}	nsIRunnable
-		nsISupports	{...}	nsISupports
-		__vfptr	0x02e7c290 const mozilla::net::CallOnStop::`vftable'	*
		[0]	0x017886e0 mozilla::net::CallOnStop::QueryInterface(const nsID &, void * *)	*
		[1]	0x01788520 mozilla::net::CallOnStop::AddRef(void)	*
		[2]	0x01788590 mozilla::net::CallOnStop::Release(void)	*
-		mRefCnt	{mValue=1 }	nsAutoRefCnt
		mValue	1	unsigned long
-		_mOwningThread	{mThread=0x049f09a8 }	nsAutoOwningThread
		mThread	0x049f09a8	void *
-		mChannel	{mRawPtr=0x05b29fc8 }	nsRefPtr<mozilla::net::WebSocketChannel>
-		mRawPtr	0x05b29fc8 {mRefCnt={...} _mOwningThread={...} mSocketThread={...} ...}	mozilla::net::WebSocketChannel *
+		mozilla::net::BaseWebSocketChannel	{mOriginalURI={...} mURI={...} mListener={...} ...}	mozilla::net::BaseWebSocketChannel
+		nsIHttpUpgradeListener	{...}	nsIHttpUpgradeListener
+		nsIStreamListener	{...}	nsIStreamListener
+		nsIInputStreamCallback	{...}	nsIInputStreamCallback
+		nsIOutputStreamCallback	{...}	nsIOutputStreamCallback
+		nsITimerCallback	{...}	nsITimerCallback
+		nsIDNSListener	{...}	nsIDNSListener
+		nsIInterfaceRequestor	{...}	nsIInterfaceRequestor
+		nsIChannelEventSink	{...}	nsIChannelEventSink
+		mRefCnt	{mValue=3722304989 }	nsAutoRefCnt
+		_mOwningThread	{mThread=0xdddddddd }	nsAutoOwningThread
+		mSocketThread	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIEventTarget>
+		mChannel	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIHttpChannelInternal>
+		mHttpChannel	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIHttpChannel>
+		mDNSRequest	{mRawPtr=0xdddddddd }	nsCOMPtr<nsICancelable>
+		mRedirectCallback	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIAsyncVerifyRedirectCallback>
+		mRandomGenerator	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIRandomGenerator>
+		mHashedSecret	{...}	nsCString
+		mAddress	{...}	nsCString
+		mTransport	{mRawPtr=0xdddddddd }	nsCOMPtr<nsISocketTransport>
+		mSocketIn	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIAsyncInputStream>
+		mSocketOut	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIAsyncOutputStream>
+		mCloseTimer	{mRawPtr=0xdddddddd }	nsCOMPtr<nsITimer>
		mCloseTimeout	3722304989	unsigned int
+		mOpenTimer	{mRawPtr=0xdddddddd }	nsCOMPtr<nsITimer>
		mOpenTimeout	3722304989	unsigned int
+		mPingTimer	{mRawPtr=0xdddddddd }	nsCOMPtr<nsITimer>
		mPingTimeout	3722304989	unsigned int
		mPingResponseTimeout	3722304989	unsigned int
+		mLingeringCloseTimer	{mRawPtr=0xdddddddd }	nsCOMPtr<nsITimer>
		mMaxConcurrentConnections	-572662307	int
		mRecvdHttpOnStartRequest	1	unsigned int
		mRecvdHttpUpgradeTransport	0	unsigned int
		mRequestedClose	1	unsigned int
		mClientClosed	1	unsigned int
		mServerClosed	1	unsigned int
		mStopped	0	unsigned int
		mCalledOnStop	1	unsigned int
		mPingOutstanding	1	unsigned int
		mAllowCompression	1	unsigned int
		mAutoFollowRedirects	0	unsigned int
		mReleaseOnTransmit	1	unsigned int
		mTCPClosed	1	unsigned int
		mOpenBlocked	1	unsigned int
		mOpenRunning	0	unsigned int
		mChannelWasOpened	1	unsigned int
		mMaxMessageSize	-572662307	int
		mStopOnClose	3722304989	unsigned int
		mServerCloseCode	56797	unsigned short
+		mServerCloseReason	{...}	nsCString
		mScriptCloseCode	56797	unsigned short
+		mScriptCloseReason	{...}	nsCString
+		mFramePtr	0xdddddddd <Bad Ptr>	unsigned char *
+		mBuffer	0xdddddddd <Bad Ptr>	unsigned char *
		mFragmentOpcode	221 'Ý'	unsigned char
		mFragmentAccumulator	3722304989	unsigned int
		mBuffered	3722304989	unsigned int
		mBufferSize	3722304989	unsigned int
+		mInflateReader	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIStreamListener>
+		mInflateStream	{mRawPtr=0xdddddddd }	nsCOMPtr<nsIStringInputStream>
+		mCurrentOut	0xdddddddd {mMsg={...} mMsgType=??? mLength=??? }	mozilla::net::OutboundMessage *
		mCurrentOutSent	3722304989	unsigned int
+		mOutgoingMessages	{mSize=-572662307 mCapacity=-572662307 mOrigin=-572662307 ...}	nsDeque
+		mOutgoingPingMessages	{mSize=-572662307 mCapacity=-572662307 mOrigin=-572662307 ...}	nsDeque
+		mOutgoingPongMessages	{mSize=-572662307 mCapacity=-572662307 mOrigin=-572662307 ...}	nsDeque
		mHdrOutToSend	3722304989	unsigned int
+		mHdrOut	0xdddddddd <Bad Ptr>	unsigned char *
+		mOutHeader	0x05b2a190 "ÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ"	unsigned char [1016]
+		mCompressor	0xdddddddd {mActive=??? mZlib={...} mStream={...} ...}	mozilla::net::nsWSCompression *
		mDynamicOutputSize	3722304989	unsigned int
+		mDynamicOutput	0xdddddddd <Bad Ptr>	unsigned char *
		mData	2147549183	unsigned int
Comment 21 Bob Clary [:bc:] 2012-01-04 06:08:03 PST
I didn't expand all of the variables, but one stands out as a source of fun/pwnage:

-		nsIHttpUpgradeListener	{...}	nsIHttpUpgradeListener
-		nsISupports	{...}	nsISupports
+		__vfptr	0xdddddddd	*
Comment 22 Patrick McManus [:mcmanus] 2012-01-04 12:19:42 PST
I need to do some more testing (including waiting for my windows build to finish), but I think the patch below is (at least part of) the solution.

AbortSession() can run on either the main thread or the socket thread. If it does run on the socket thread then the event queue can run synchronously on the stack which means the fin message is dispatched without the stoponclose reason being set - which leads to all kinds of bad things when we're already stopped. (basically the state is inconsistent).

all the cpu on this basically unrelated to websockets - spinning gifs, flash, etc..

@@ -1679,20 +1680,20 @@ WebSocketChannel::AbortSession(nsresult 
 
   if (mStopped)
     return;
   mStopped = 1;
 
   if (mTransport && reason != NS_BASE_STREAM_CLOSED &&
       !mRequestedClose && !mClientClosed && !mServerClosed) {
     mRequestedClose = 1;
+    mStopOnClose = reason;
     mSocketThread->Dispatch(
       new OutboundEnqueuer(this, new OutboundMessage(kMsgTypeFin, nsnull)),
                            nsIEventTarget::DISPATCH_NORMAL);
-    mStopOnClose = reason;
   } else {
     StopSession(reason);
   }
 }
Comment 23 Patrick McManus [:mcmanus] 2012-01-04 15:19:53 PST
Created attachment 585917 [details] [diff] [review]
patch v0

dindog@163.com can you verify using a build from

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mcmanus@ducksong.com-7f69fdb3050b/

?
Comment 24 dindog 2012-01-04 22:58:27 PST
(In reply to Patrick McManus [:mcmanus] from comment #23)
> Created attachment 585917 [details] [diff] [review]
> patch v0
> 
> dindog@163.com can you verify using a build from
> 
> https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mcmanus@ducksong.
> com-7f69fdb3050b/
> 
> ?
AFAICT, it doesn't help. Still hang Firefox and memory usage bump up, crash after few minutes.
Comment 25 Patrick McManus [:mcmanus] 2012-01-05 06:49:32 PST
(In reply to dindog from comment #24)

> > ?
> AFAICT, it doesn't help. Still hang Firefox and memory usage bump up, crash
> after few minutes.

drat - definitely fixed a crash I could repro with the same signature and yesterday I couldn't repro the exploding ram thing with the change; but today I can :(

I have some trouble with mason staying persistent (I keep having to reconfigure it), so maybe I tested the wrong thing or maybe its ephemeral.
Comment 26 dindog 2012-01-05 07:30:02 PST
(In reply to Patrick McManus [:mcmanus] from comment #25)
> (In reply to dindog from comment #24)
> 
> > > ?
> > AFAICT, it doesn't help. Still hang Firefox and memory usage bump up, crash
> > after few minutes.
> 
> drat - definitely fixed a crash I could repro with the same signature and
> yesterday I couldn't repro the exploding ram thing with the change; but
> today I can :(
> 
> I have some trouble with mason staying persistent (I keep having to
> reconfigure it), so maybe I tested the wrong thing or maybe its ephemeral.

That's strange. Mason being very stable and useful for me all the time.
To check if Mason work, you can open the bilibar.js in browser by its original URL, if mason does its job, the script will redirect, and you can see the URL is file:///
Comment 27 Patrick McManus [:mcmanus] 2012-01-05 09:21:03 PST
here's my theory:

from the nspr logging it appears that the js gets into a very quick loop sequence of new/open/close without waiting for the open error event. (there is some code in the js that just checks the state of the object looking for OPENED). 

This results in a bunch of internal asynchronous stuff building up very long queues because while the channel can ackowledge the close() event it can't really clean up the channel cpp object until the internal events are cleaned up. The logs show that happening in bigger and bigger bursts as time goes on and it can't keep up because opens are happening so freaking fast. That's basically livelock.

there is code in there to protect the socket pool from being depleted - no more than 200 simultaneous sockets. Makes a lot of sense to me to change that semantic to be no more than 200 simultaneous websocket objects (which can be checked at open time, before anything goes async) because after all a websocket object is no good without a socket anyhow - this way it will fail early.
Comment 28 Patrick McManus [:mcmanus] 2012-01-05 14:52:45 PST
Created attachment 586231 [details] [diff] [review]
part 2 v 0

Count all channels towards max session limit, not just connected ones. As per comment 27
Comment 29 Patrick McManus [:mcmanus] 2012-01-05 15:03:32 PST
Adding Boris, Jonas, and Olli as my experts on the JS and Content interaction.

This bug is a bit of a mess (and so is the website in question!).

There are two patches attached already - the first one fixes a legit crash bug.

The JS in question tries to use websockets but the handshake fails legitimately beacause their server is broken. Their error handling logic is extremely convoluted but it clearly results in more and more requests being made. (probably exponentially growing where 1 failure begats > 1 retries which also fail).

The second patch seriously reduces our resource usage when under this flood of websocket open requests by failing early when the parallel limit is reached. It helps a lot.

But the experience still sucks - the browser basically hangs (maybe updating once or twice in a couple minute period), and you never get the runaway script dialog. Memory also grows and it will eventually oom.

I've reduced it to a simple illustration - though you probably won't get too far without applying the patches above at http://www.ducksong.com/mozilla/712572/simple.js - so you don't get bogged down in the bilibili.tv crud.

I don't think the script technically leaks.

my question is should this run any better than this? (be more interruptable?)
Comment 30 Jason Duell [:jduell] (needinfo? me) 2012-01-05 18:08:14 PST
Comment on attachment 586231 [details] [diff] [review]
part 2 v 0

Review of attachment 586231 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.  My one concern was to make sure that the 200 Websocket-limit wasn't impeded by anything holding onto websockets for an extraordinary amount of time (like the way HTTP channels are kept around for ages by docshells that are kept around to support back/forward history navigation).  But it looks like we'll free the WebSocketChannel even if something keeps the nsWebSocket around for a long time (which I'm guessing nothing does anyway).
Comment 31 Boris Zbarsky [:bz] 2012-01-05 20:54:49 PST
The script linked in comment 29 basically comes down to this:

  function f() {
    for (var i = 0; i < 32; ++i) setTimeout(f);
  }
  setTimeout(f);

right?  It's using setInterval and clearing them, so setTimeout is equivalent....
Comment 32 Boris Zbarsky [:bz] 2012-01-05 20:56:00 PST
Oh, except it's only clearing one of the intervals, so it's more Fibonacci than just exponential.  Same deal, though as far as I can see.  Is there something more to it than that?
Comment 33 Jonas Sicking (:sicking) 2012-01-05 21:01:20 PST
The WebSocket spec says that in order to exhaust server resources we should only ever attempt to open one websocket connection at a time to the server. I.e. if someone does:

ws1 = new WebSocket(url);
ws2 = new WebSocket(url);

we should still let the second line execute successfully, but not do any network-wise for it until the first connection has either succeeded or failed. Thus we would of course not fire any success or error events on the second connection until the first one has succeeded or failed.

If we did this, we wouldn't just save server resources but also client resources. The growth should be significantly slower. Memory consumption for the connection objects should grow linearly in time, and network resources should remain constant.

If this is correct then I'd think that this is enough of a fix. There isn't a whole lot we can do about the memory growth as I see it. The best thing we could do is to log failed connections to the developer console and hope that the developer notices this.

One thing that we *could* do is to somehow log the fact that there are lots of outstanding requests preventing network activity from happening even though new WebSocket connection objects being instantiated. That seems like it might help *this* developer, but I'm not convinced that this'll be a common enough problem that it's worth spending time on.
Comment 34 Patrick McManus [:mcmanus] 2012-01-06 05:27:12 PST
(In reply to Boris Zbarsky (:bz) from comment #31)
> The script linked in comment 29 basically comes down to this:
> 
>   function f() {
>     for (var i = 0; i < 32; ++i) setTimeout(f);
>   }
>   setTimeout(f);
> 
> right?  It's using setInterval and clearing them, so setTimeout is
> equivalent....

as far as I can tell, yes - same stuff. And the result of this is to almost completely freeze the browser and no "kill this script" dialog pops up.

I can see why that would be :) - my question is basically is this a known quantity and ok from the perspective of the status quo.

If so I think we can complete this bug with the two patches above which eliminate a crash and keep the memory usage down to a small dribble.
Comment 35 Patrick McManus [:mcmanus] 2012-01-06 05:34:00 PST
(In reply to Jonas Sicking (:sicking) from comment #33)
> The WebSocket spec says that in order to exhaust server resources we should
> only ever attempt to open one websocket connection at a time to the server.
> I.e. if someone does:
> 
> ws1 = new WebSocket(url);
> ws2 = new WebSocket(url);
> 
> we should still let the second line execute successfully, but not do any
> network-wise for it until the first connection has either succeeded or
> failed. Thus we would of course not fire any success or error events on the
> second connection until the first one has succeeded or failed.

we do that. One caveat - the spec says that the url must be dns resolved before deciding to queue or process.

and with patch 2 above we actually throw an exception when there are hundreds of websockets outstanding - which would flood the socket pool and make http unusable.

That change has made the original 'use-case' attached to this bug pretty stable, but the browser has no interactivity. I just wanted to make sure that is more or less expected the way it was crafted (which I think we can clearly say was a development bug and they've changed the code on the website now). bz might be indicating in comment 31 that this is basically expected for that case.

(his simpler case that doesn't use ws or xhr at all exhibits the same behavior)

> 
> If we did this, we wouldn't just save server resources but also client
> resources. The growth should be significantly slower. Memory consumption for
> the connection objects should grow linearly in time, and network resources
> should remain constant.
> 
> If this is correct then I'd think that this is enough of a fix. There isn't
> a whole lot we can do about the memory growth as I see it. The best thing we
> could do is to log failed connections to the developer console and hope that
> the developer notices this.
> 
> One thing that we *could* do is to somehow log the fact that there are lots
> of outstanding requests preventing network activity from happening even
> though new WebSocket connection objects being instantiated. That seems like
> it might help *this* developer, but I'm not convinced that this'll be a
> common enough problem that it's worth spending time on.
Comment 36 Boris Zbarsky [:bz] 2012-01-06 07:30:28 PST
> my question is basically is this a known quantity

I believe it is, though I can't find a bug tracking it.  Just file one, and don't worry about it here, please?
Comment 37 Jonas Sicking (:sicking) 2012-01-06 08:35:59 PST
Why does it have the DNS requirement? Do we need to follow that?
Comment 38 Patrick McManus [:mcmanus] 2012-01-06 08:37:14 PST
(In reply to Boris Zbarsky (:bz) from comment #36)
  Just file one, and
> don't worry about it here, please?

bug 715918
Comment 39 Patrick McManus [:mcmanus] 2012-01-06 08:41:20 PST
(In reply to Jonas Sicking (:sicking) from comment #37)
> Why does it have the DNS requirement? Do we need to follow that?

the no-parallel-handshake to same host rule is designed to protect non-websocket servers from dos.

allowing alternate hostnames in urls to bypass that restriction makes it trivial for an attacker to bypass it because the alternate hostnames can be put in a domain other than that owned by the server.
Comment 40 Jonas Sicking (:sicking) 2012-01-06 10:13:46 PST
Ah, so the idea is to not open multiple connections to the same IP, rather than the same DNS?

I.e. we should potentially stall even longer than one connection per hostname?
Comment 41 Patrick McManus [:mcmanus] 2012-01-06 10:48:44 PST
(In reply to Jonas Sicking (:sicking) from comment #40)
> Ah, so the idea is to not open multiple connections to the same IP, rather
> than the same DNS?
> 

yep. It kinda breaks down with proxies and LB and whatnot, but that's the idea.
Comment 44 Patrick McManus [:mcmanus] 2012-01-08 08:45:22 PST
Comment on attachment 585917 [details] [diff] [review]
patch v0

[Approval Request Comment]
Regression caused by (bug #): 

not a regression - previously unknown crash that might have a security implication

User impact if declined: 

crash with unknown security implciation triggered by malicious websockets code and some timing luckiness (no known exploit, only a security issue in so far as any corruption crash might be one)

Testing completed (on m-c, etc.): 

m-c

Risk to taking this patch (and alternatives if risky): low
Comment 45 dindog 2012-01-08 22:23:46 PST
Why this mark as fixed? It looks everything the same as first reported to me
Comment 46 Patrick McManus [:mcmanus] 2012-01-09 06:07:26 PST
(In reply to dindog from comment #45)
> Why this mark as fixed? It looks everything the same as first reported to me

Hi,

there were a couple issues conflated in this bug.. one is the crash on exit or quick-crash that bob clary was seeing and that some of the crashes showed. That should be fixed by patch one. I don't believe you reported that.

The other issue is the pathalogical behavior of exponential growth around settimer/setinterval that this bug illustrates. patch 2 makes the memory use with that better for websockets but it doesn't address the underlying problem which is bigger than websockets.

That issue is being tracked under 715918 and is still open.

this bug is marked fixed because it is marked as a security sensitive bug and the crash associated with that (not the out of memory one) is believed fixed with patch 1. Let me know if you're still seeing that. Non security issues should be pursued in an unrestricted bug such as 715918.

And its worth noting that there isn't a regression in any meaningful sense here. The site's crazy javascript just didn't kick in with the old prefixed websockets.
Comment 47 dindog 2012-01-09 08:31:12 PST
Thanks for you patient explanation... though I still not get it clear.

Does it mean the memory & CPU issue is partially cause by WebSocket, and other part need to deal with other bugs specific on that?
Comment 48 Alex Keybl [:akeybl] 2012-01-10 15:02:10 PST
Comment on attachment 585917 [details] [diff] [review]
patch v0

[Triage Comment]
Given the fact that this is unknown outside Mozilla and we're now very late in FF10, denying for beta. Given the low-risk nature and bake time, approving for aurora.
Comment 49 Patrick McManus [:mcmanus] 2012-01-10 17:59:05 PST
(In reply to Alex Keybl [:akeybl] from comment #48)
> Comment on attachment 585917 [details] [diff] [review]
> patch v0
> 
> [Triage Comment]
> Given the fact that this is unknown outside Mozilla and we're now very late
> in FF10, denying for beta. Given the low-risk nature and bake time,
> approving for aurora.

https://hg.mozilla.org/releases/mozilla-aurora/rev/10705251c038
Comment 50 Patrick McManus [:mcmanus] 2012-01-10 18:13:52 PST
(In reply to dindog from comment #47)
> Thanks for you patient explanation... though I still not get it clear.
> 
> Does it mean the memory & CPU issue is partially cause by WebSocket, and
> other part need to deal with other bugs specific on that?

hi dindog, thanks for filing the bug and caring about it.

Basically - the only real websockets problem is comment 20. That's also the security related crash (and this is a security restricted bug). That's fixed by patch 1 I believe.

Most of what you observe (the responsiveness freeze and the memory growth) are due to bug 715918 - which remains open. That's not websockets specific, its a JS/DOM interaction thing. Unprefixing websockets resulted in different JS being run which is why the bug correlates with that along with the fact that websockets objects are what are being created - but as that other bug shows you can accomplish the same thing without WS.

Its important to pursue those other issues in the other bug because that one is not security restricted.

hope this helps.
Comment 51 Lukas Blakk [:lsblakk] use ?needinfo 2012-02-23 16:46:23 PST
[Triage comment]

This bug is being tracked for landing on ESR branch.  Please land patches on http://hg.mozilla.org/releases/mozilla-esr10/by Thursday March 1, 2012 in order to be ready for go-to-build on Friday March 2, 2012.

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more information.
Comment 52 Jason Duell [:jduell] (needinfo? me) 2012-02-24 12:40:08 PST
I'm working on esr landing.
Comment 53 Jason Duell [:jduell] (needinfo? me) 2012-02-27 10:35:06 PST
Landed on esr:

  https://hg.mozilla.org/releases/mozilla-esr10/rev/333283a853f0
Comment 54 Patrick McManus [:mcmanus] 2012-02-27 13:24:57 PST
(In reply to Jason Duell (:jduell) from comment #53)
> Landed on esr:
> 
>   https://hg.mozilla.org/releases/mozilla-esr10/rev/333283a853f0

thanks jason!
Comment 55 Tracy Walker [:tracy] 2012-03-05 13:55:20 PST
verified on latest builds of 10.0.3esr, 11, 12 and 13 on Win 7

Note You need to log in before you can comment on or make changes to this bug.