Closed Bug 1227136 Opened 8 years ago Closed 8 years ago

crash in mozilla::net::WebSocketChannel::StartWebsocketData

Categories

(Core :: Networking: WebSockets, defect)

42 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox42 --- affected
firefox43 --- affected
firefox44 --- affected
firefox46 --- fixed
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: philipp, Assigned: michal)

References

Details

(Keywords: crash, reproducible, topcrash, Whiteboard: [necko-active])

Crash Data

Attachments

(3 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-2bbb217c-d797-436a-b3d0-aa1822151118.
=============================================================
Crashing Thread
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::net::WebSocketChannel::StartWebsocketData() 	netwerk/protocol/websocket/WebSocketChannel.cpp
1 	xul.dll 	mozilla::net::WebSocketChannel::OnTransportAvailable(nsISocketTransport*, nsIAsyncInputStream*, nsIAsyncOutputStream*) 	netwerk/protocol/websocket/WebSocketChannel.cpp
2 	xul.dll 	mozilla::net::CallOnTransportAvailable::Run() 	netwerk/protocol/websocket/WebSocketChannel.cpp
3 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
4 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
5 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc
6 	xul.dll 	nsThreadManager::UnregisterCurrentThread(nsThread*) 	xpcom/threads/nsThreadManager.cpp
7 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp
8 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp
9 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp
10 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp
11 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp

this is a crash-platform crash, but predominantly occurring on android, where it's at #30 with 0.34% of all crashes.

the signature seems to have taken off at the end of october:
https://crash-stats.mozilla.com/signature/?date=%3E2015-10-20&signature=mozilla%3A%3Anet%3A%3AWebSocketChannel%3A%3AStartWebsocketData#graphs

some of the comments seem to indicate that it is happening repeatedly & maybe reproducibly on certain sites - domains that are repeatedly mentioned are wunderground.com and ustream.tv
seems pretty easy for me to reproduce just trying to bring up latest radar view of SFO on weatherunderground.  using 

doesn't seem like it would be too hard for this to spike if we had a turn in usage to hit that kind of usage on a popular site due to a change in browsing behavior like a storms moving in the path of a bunch of fennec users.
Keywords: reproducible
didn't finish my "using"  comment.  using 44b4.  when I ask for an update to something later like 44b8 or 9 that should be available with a check for update, it just says I'm up-to-date, but thats another problem.
this shows up in lots of places on desktop and fennec and might have been around for awhile, but its certainly more likely to be seen on fennec, and might be getting worse due to ticking some new code lately, or just changing in browsing patterns like mentioned in comment 1.

Product 	Version 	Percentage 	Number Of Crashes
FennecAndroid 	43.0 	66.97% 	2770
Firefox 	43.0.4 	13.97% 	578
FennecAndroid 	41.0.2 	3.96% 	164
FennecAndroid 	42.0.2 	1.81% 	75
Firefox 	44.0b8 	1.79% 	74
Firefox 	44.0b7 	1.64% 	68
FennecAndroid 	42.0.1 	1.26% 	52
FennecAndroid 	42.0 	1.04% 	43
FennecAndroid 	41.0 	0.90% 	37
FennecAndroid 	44.0b7 	0.90% 	37
FennecAndroid 	44.0b8 	0.77% 	32
comment zero says ramp up on crash frequency started around end of oct and that corresponds to the release of 41

top line on the stacks shows this code changing back in May as a fix for Bug 1152334. That doesn't quite match up with the change that rode the trains as part of 40, but maybe there was some other intervening bug or delayed crash volume ramp up.

author	Michal Novotny <michal.novotny@gmail.com>
	Mon, 11 May 2015 08:12:19 +0200
changeset 275690 	f39f5779392d
parent 275689 	876a0813cdc5
child 275691 	0d036289fac1
push id	897
push user	jlund@mozilla.com
push date	2015-09-14 18:56 +0000
Treeherder	results
reviewers	bagder
bugs	1152334
milestone	40.0a1
Bug 1152334 - StartWebsocketData null deref mSocketIn, r=bagder 

http://hg.mozilla.org/releases/mozilla-release/annotate/e2f9a0ed50bc/netwerk/protocol/websocket/WebSocketChannel.cpp#l2738

http://hg.mozilla.org/releases/mozilla-release/rev/f39f5779392d
er, I guess push date	2015-09-14 18:56 +0000 would have tied this change closer to the release of 41 since that date is on the other side of the merge date
in and out of the topcrash lists on fennec, probably due to weather and people checking weather underground weather.  several comments about people switching to mobile chrome which doesn't have this crash so this ought to get some priority.  

Other comments talk about it might be related to the dialog that asks to share location data.  I saw that too, and said "ok to share."
Keywords: topcrash
Flags: needinfo?(michal.novotny)
Bug 1152334 tried to fix this crash, but apparently the patch there didn't help. But I'm pretty sure that the patch couldn't make things worse. I was not able to reproduce the crash, I'll see if I can reproduce it now.
Flags: needinfo?(michal.novotny)
The site removed/disabled whatever code was problematic. We would need to reach out to wunderground and get a staging site or something similar.
yeah, I see dramtic simplification of the wunderground content now.  I wonder if it will stay this way or its temporary.

general site contact info at http://www.wunderground.com/about/contact.asp

:blassey
do we have evangelism folks with better direct contacts at wunderground? that could give us some sample code of what they've been running the last few weeks and now have removed?

still other sites that are in the crash data comments, we need to keep looking for repro on those too.

:kbrosnan,  I've lost my access to URL data.  Can you gather some out of socorro, sanitize, and share in this bug?  I can help test.

from recent comments:
keeps crashing. Messenger from Facebook service called messenger 
I keep losing my messages on Facebook behind this crash please fix it a s a p 

PBS NewHour live stream through Ustream caused the crash. 
crashing attempting to connect to ustream.tv - this is getting ridiculous as it's been happening through several of your builds - goodbye firefox - hello chrome 
and other general ustream related comments

:michal and :mcmanus
anything interesting looking at that code at the top of stack that might have changed about the same time this crashes began climbing beyond the background noise they were before that?  

Seems this is an old signature, but the code changed *and* sites seem to have tickled the crashes in unexpected ways.  That makes this one doubly hard to analyze, but we can not stop pursuing any avenue available to us if we want to stamp out a possible rise in this again.
Sent URLs to Chris. Mostly wunderground, some other live streaming media sites such as ustream were on the list.
bob, does this signature ever show up much in your bughunter automation?

looking broader at the urls there are a variety of streaming media sites that seem to tickle the crash as kbrosnan mentioned.

do we have any test cases or automation that specifically grinds on a streaming media set up on both desktop and mobile?

what does the operation of opening and closing sockets look like when the expected behavior is that the site just keeps wanting to stream video content until the streaming session ends, but site reliability, or the network, or other things going on in the client might be unreliable and create delays, disconnects, time outs, and race conditions. 

Something like that might help to explain why we see this much more on mobile where the network is more prone to loading delays/reliabilty, and we see high volatility in the crash volume. 

With the growth of streaming seems likely that this kind of bug *will* return if this thesis is correct, and the thesis might help us to build a more general testing scenario for a growing part of the web and reproducing this bug.
No, we haven't seen this particular issue nor crashes on ustream.tv. We have been testing last years urls but should be going live soon.
So the problem is that WebSocketChannel::Close() is called by nsIWebSocketListener from OnStart() method, it enqueues an outgoing fin message and the message is processed and the connection cleaned up on the background thread sooner than we call mSocketIn->AsyncWait() in WebSocketChannel::StartWebsocketData(). I'll have a patch soon.
Attached patch fix (obsolete) — Splinter Review
Listener is notified after we call AsyncWait() in WebSocketChannel::StartWebsocketData().
Assignee: nobody → michal.novotny
Status: NEW → ASSIGNED
Attachment #8711735 - Flags: review?(daniel)
Comment on attachment 8711735 [details] [diff] [review]
fix

Review of attachment 8711735 [details] [diff] [review]:
-----------------------------------------------------------------

simple and clear!
Attachment #8711735 - Flags: review?(daniel) → review+
Looks like the change makes websockets/unload-a-document/001.htm to fail :-/

https://treeherder.mozilla.org/logviewer.html#?job_id=15902446&repo=try
I'm trying to run the test locally to find out what's wrong but the test isn't run when I run all web-platform tests with "./mach web-platform-tests". Doing "./mach web-platform-tests websockets/unload-a-document/001.htm" ends immediately saying "Ran 0 tests". Ted, how can I run this particular test?
Flags: needinfo?(ted)
Sorry, I don't know that much about wpt, you should ask jgraham.
Flags: needinfo?(ted) → needinfo?(james)
./mach web-platform-tests websockets/unload-a-document/001.html (notice the extra trailing 'l')

or 

./mach web-platform-tests testing/web-platform/tests/websockets/unload-a-document/001.html

if you want command line completion.
Flags: needinfo?(james)
Attached file console log
Sorry for the typo, but with correct test name it doesn't work either. Can you read something from the attached console log? Btw, why the test isn't run when I run all web-platform tests?
Flags: needinfo?(james)
FYI, it complains that something could be listening on the selected port, but I checked ports 2828-2832 and there is nothing listening on these ports, so there must be some other issue.
So it seems that the initialization times out. Increasing init_timeout in testing/web-platform/harness/wptrunner/browsers/base.py fixes the problem.
Flags: needinfo?(james)
The test fails because ws.onclose is called. IIUC the test assumes that onclose shoudn't be called because the windows is navigated away in onopen. But there is obviously some delay between calling controller.navigate() and destroying the window. This patch changes the timing of the events so that onopen is sent a bit later, so there is now a shorter delay between onopen and onclose. I don't know whether this is a problem in mozilla::dom::WebSocketImpl or in window.location, but IMO we call nsIWebSocketListener::onStart and nsIWebSocketListener::onStop() correctly in WebSocketChannel.
OK, let me see if I understand.  The test is failing without taking the crash codepath (i.e. the websocket server connection happens fine).  So the test failure is purely due to your moving the code around in StartWebsocketData() so that mListener->OnStart() now gets called *after* socket->AsyncWait() instead of before it.  Is that right?

If so I'm having trouble figuring out how this could change the timings of onStart/onStop significantly, since no matter what we'll call mListener->OnStart no more than a few nanoseconds later than we used to (unless I'm really missing something).

I'm going to ask :baku if he can look into this.  Andrea, this is a fix for a topcrasher, hence some urgency.  Do you know how the Websocket.cpp code generally blocks "onStop" events from happening when the window/tab for the websocket is closed?  For some reason the small code reordering in this patch seems to cause onclose to get dispatched when the test doesn't expect it (it closes the websocket's windows during OnStart, IIUC).
Flags: needinfo?(amarchesini)
(In reply to Jason Duell [:jduell] (needinfo? me) from comment #26)
> OK, let me see if I understand.  The test is failing without taking the
> crash codepath (i.e. the websocket server connection happens fine).  So the
> test failure is purely due to your moving the code around in
> StartWebsocketData() so that mListener->OnStart() now gets called *after*
> socket->AsyncWait() instead of before it.  Is that right?

Right.

> If so I'm having trouble figuring out how this could change the timings of
> onStart/onStop significantly, since no matter what we'll call
> mListener->OnStart no more than a few nanoseconds later than we used to
> (unless I'm really missing something).

OnStart might be called just few nanoseconds later, but since we start all the websocket machinery before OnStart the delay between OnStart/OnStop is shorter by time spent in OnStart (which might take a while).
Sorry for the delay. I want to apply the patch and see why we have this issue.
Does it happen 100% of the times?
Flags: needinfo?(amarchesini)
On my computer it fails in about 80% cases.
We're still hoping :baku can take a look at this.
Flags: needinfo?(amarchesini)
First of all, sorry for the delay.

The issue here is that CheckInnerWindowCorrectness() sometimes returns NS_OK and we dispatch the 'close' event. Digging into this issue a bit more the workflow is this:

1. nsDocShell::Stop() is executed. We go to nsLoadGroup::Cancel() and from there we go to mozilla::dom::WebSocketImpl::Cancel().

2. We close the channel in the WebSocketImpl::CancelInternal() and then we receive OnStop().

3. From here we call WebSocketImpl::ScheduleConnectionCloseEvents().

The best way we can fix this issue is to change this:

https://dxr.mozilla.org/mozilla-central/source/dom/base/WebSocket.cpp#801
  // called by network stack, not JS, so can dispatch JS events synchronously                                    
  return ScheduleConnectionCloseEvents(aContext, aStatusCode, true);           

and always dispatch such close event asynchronously. I cannot remember if this breaks other things, but it should not.
Flags: needinfo?(amarchesini)
Whiteboard: [necko-active]
Attached patch patch v2 (obsolete) — Splinter Review
Please review the change in dom/base/WebSocket.cpp

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c9b11c27f149
Attachment #8711735 - Attachment is obsolete: true
Attachment #8727945 - Flags: review?(amarchesini)
Comment on attachment 8727945 [details] [diff] [review]
patch v2

Review of attachment 8727945 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/WebSocket.cpp
@@ +797,5 @@
>    MOZ_ASSERT(mWebSocket->ReadyState() != WebSocket::CLOSED,
>               "Shouldn't already be CLOSED when OnStop called");
>  
> +  // dispatch the event asynchronously (see bug #1227136 comment 31)
> +  return ScheduleConnectionCloseEvents(aContext, aStatusCode, false);

We don't call ScheduleConnectionCloseEvents with sync=true anywhere else. Just remove the boolean from the method.
Attachment #8727945 - Flags: review?(amarchesini) → review-
Attached patch patch v3Splinter Review
Attachment #8727945 - Attachment is obsolete: true
Attachment #8728422 - Flags: review?(amarchesini)
Attachment #8728422 - Flags: review?(amarchesini) → review+
Comment on attachment 8728422 [details] [diff] [review]
patch v3

Approval Request Comment
[Feature/regressing bug #]: unknown
[User impact if declined]: firefox can crash when websocket channel is closed from nsIWebSocketListener::onStart
[Describe test coverage new/current, TreeHerder]: tested manually on ustream.tv site
[Risks and why]: low - This patch changes timing of posted events which pointed out another problem that was also fixed in this patch. I can't guarantee that another intermittent failure can't show up but even then it's better to fail some websocket channel than to crash.
[String/UUID change made/needed]: none
Attachment #8728422 - Flags: approval-mozilla-beta?
Attachment #8728422 - Flags: approval-mozilla-aurora?
> [User impact if declined]: firefox can crash when websocket channel is closed from 
> nsIWebSocketListener::onStart
>[Describe test coverage new/current, TreeHerder]: tested manually on ustream.tv site

Is the socket closed as the result of something that ustream.tv, and maybe other sites, intentionally do; or what kind of unintentional actions or time outs, or other things might lead to the socket closing.

Is it possible to construct a test case that sets up the events that reliably would lead to the crash, and now would not after the patch is applied?

What is the user impact of  "fail[ing] some websocket channel" - in the ustream.tv case would that mean video streaming stops?  if that is the case what actions might be required by the user or other to restart the streaming?  page reload?
(In reply to chris hofmann from comment #37)
> > [User impact if declined]: firefox can crash when websocket channel is closed from 
> > nsIWebSocketListener::onStart
> >[Describe test coverage new/current, TreeHerder]: tested manually on ustream.tv site
> 
> Is the socket closed as the result of something that ustream.tv, and maybe
> other sites, intentionally do; or what kind of unintentional actions or time
> outs, or other things might lead to the socket closing.

The channel was closed in onStart, i.e. just immediately after the connection was negotiated. I'm not sure what was the purpose of doing this, maybe connectivity testing?
 
> Is it possible to construct a test case that sets up the events that
> reliably would lead to the crash, and now would not after the patch is
> applied?

It should be possible to create such test, but I'm not sure if the crash can be reproduced consistently. I'll give it a try.

> What is the user impact of  "fail[ing] some websocket channel" - in the
> ustream.tv case would that mean video streaming stops?  if that is the case
> what actions might be required by the user or other to restart the
> streaming?  page reload?

It's hard to tell what could go wrong. In case of the problem found and described in comment 25, a callback that shouldn't be called was called after navigating page away. In this particular example I don't think that it could cause a problem on any page using websocket.
https://hg.mozilla.org/mozilla-central/rev/da2e0df696eb
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Michal, I looked at the crash signature and noticed that there are no reports of this crash on a build greater than 2016-03-03 on Fx45 (Build Id: 20160303134406). This crash does not show up in 46 and 47. Could this have been fixed by another patch somewhere along the way? Does it still make sense to uplift this to Beta and Aurora? I doubt it but wanted to check with you.
Flags: needinfo?(michal.novotny)
I'm not aware of any change that would fix this issue. Maybe something has changes on ustream.tv site. It already happened in the past in bug 1152334 which was IMO dupe of this bug. Although the fix in bug 1152334 didn't fix the problem for some reason there were no new crash reports until now.
Flags: needinfo?(michal.novotny)
Btw, I still have a plan to try to create some test for the crash.
(In reply to Michal Novotny (:michal) from comment #42)
> Btw, I still have a plan to try to create some test for the crash.

Thanks Michal. Let's re-evaluate uplifting this fix once a reproducible crash test is ready. I hope it's soon.
Comment on attachment 8728422 [details] [diff] [review]
patch v3

Please see comments 40 and 43. I will approve the uplift once we have a reproducible crash test.
Attachment #8728422 - Flags: approval-mozilla-beta?
Attachment #8728422 - Flags: approval-mozilla-beta-
Attachment #8728422 - Flags: approval-mozilla-aurora?
Attachment #8728422 - Flags: approval-mozilla-aurora-
Attached patch testSplinter Review
Attachment #8734468 - Flags: review?(daniel)
(In reply to Ritu Kothari (:ritu) from comment #43)
> (In reply to Michal Novotny (:michal) from comment #42)
> > Btw, I still have a plan to try to create some test for the crash.
> 
> Thanks Michal. Let's re-evaluate uplifting this fix once a reproducible
> crash test is ready. I hope it's soon.

test:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9588bfb1e67a&selectedJob=18556028

test and fix backed out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3870d2544482&selectedJob=18559465
Attachment #8728422 - Flags: approval-mozilla-beta?
Attachment #8728422 - Flags: approval-mozilla-beta-
Attachment #8728422 - Flags: approval-mozilla-aurora?
Attachment #8728422 - Flags: approval-mozilla-aurora-
(In reply to Michal Novotny (:michal) from comment #46)
> (In reply to Ritu Kothari (:ritu) from comment #43)
> > (In reply to Michal Novotny (:michal) from comment #42)
> > > Btw, I still have a plan to try to create some test for the crash.
> > 
> > Thanks Michal. Let's re-evaluate uplifting this fix once a reproducible
> > crash test is ready. I hope it's soon.
> 
> test:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=9588bfb1e67a&selectedJob=18556028
> 
> test and fix backed out:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=3870d2544482&selectedJob=18559465

Awesome work! This is amazing. I bow to your crash test and approve the fix. :)
Comment on attachment 8728422 [details] [diff] [review]
patch v3

Crash fix that was verified using an automated crash test. Aurora47+, Beta46+
Attachment #8728422 - Flags: approval-mozilla-beta?
Attachment #8728422 - Flags: approval-mozilla-beta+
Attachment #8728422 - Flags: approval-mozilla-aurora?
Attachment #8728422 - Flags: approval-mozilla-aurora+
Attachment #8734468 - Flags: review?(daniel) → review+
You need to log in before you can comment on or make changes to this bug.