Last Comment Bug 716841 - EventSource::GetInterface goes into infinite loop if EventSource hits an HTTP redirect
: EventSource::GetInterface goes into infinite loop if EventSource hits an HTTP...
Status: RESOLVED FIXED
[qa-]
: crash, regression
Product: Core
Classification: Components
Component: Networking: WebSockets (show other bugs)
: 11 Branch
: All Windows 7
: -- critical (vote)
: mozilla13
Assigned To: Jason Duell [:jduell] (needinfo me)
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on:
Blocks: 664179
  Show dependency treegraph
 
Reported: 2012-01-10 02:06 PST by Scoobidiver (away)
Modified: 2012-03-05 17:53 PST (History)
13 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
+
fixed
unaffected


Attachments
Fix and mochitest (5.87 KB, patch)
2012-02-23 16:57 PST, Jason Duell [:jduell] (needinfo me)
bzbarsky: review+
lukasblakk+bugs: approval‑mozilla‑aurora+
lukasblakk+bugs: approval‑mozilla‑beta+
Details | Diff | Splinter Review

Description Scoobidiver (away) 2012-01-10 02:06:05 PST
It's #58 top browser crasher in 11.0a2 and #107 in 12.0a1.

It first appeared in 11.0a1/20111209. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6785d3003414&tochange=9e7239c0f557

Signature 	nsEventSource::GetInterface(nsID const&, void**) More Reports Search
UUID	9ee107dd-b569-4147-8fc3-a68c42120108
Date Processed	2012-01-08 17:11:51.776646
Uptime	23100
Last Crash	17.5 hours before submission
Install Age	6.4 hours since version was first installed.
Install Time	2012-01-08 19:47:45
Product	Firefox
Version	12.0a1
Build ID	20120108031024
Release Channel	nightly
OS	Windows NT
OS Version	5.1.2600 Service Pack 3
Build Architecture	x86
Build Architecture Info	GenuineIntel family 6 model 15 stepping 11
Crash Reason	EXCEPTION_STACK_OVERFLOW
Crash Address	0x16e539b
App Notes 	
AdapterVendorID: 0x10de, AdapterDeviceID: 0x0606, AdapterSubsysID: 23351682, AdapterDriverVersion: 6.14.12.8558
D3D10 Layers? D3D10 Layers-
D3D9 Layers? D3D9 Layers+
Processor Notes 	This dump is too long and has triggered the automatic truncation routine
EMCheckCompatibility	False

Crashing Thread
Frame 	Module 	Signature [Expand] 	Source
0 	xul.dll 	nsEventSource::GetInterface 	content/base/src/nsEventSource.cpp:767
1 	xul.dll 	nsCORSListenerProxy::GetInterface 	content/base/src/nsCrossSiteListenerProxy.cpp:651
2 	xul.dll 	nsEventSource::GetInterface 	content/base/src/nsEventSource.cpp:782
3 	xul.dll 	nsCORSListenerProxy::GetInterface 	content/base/src/nsCrossSiteListenerProxy.cpp:651
4 	xul.dll 	nsEventSource::GetInterface 	content/base/src/nsEventSource.cpp:782
5 	xul.dll 	nsCORSListenerProxy::GetInterface 	content/base/src/nsCrossSiteListenerProxy.cpp:651
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsEventSource%3A%3AGetInterface%28nsID%20const%26%2C%20void**%29
Comment 1 Mats Palmgren (:mats) 2012-01-14 16:49:03 PST
In the regression window, bug 692067 seems the likely culprit.
Comment 3 Boris Zbarsky [:bz] (still a bit busy) 2012-01-14 19:20:27 PST
ccing kairo, because the correlations stuff in soccoro seems to be broken and it would be really handy right now...

I'm not sure why the content policy bit would be relevant here, but the websocket code's notification callbacks handling seems completely busted in the face of redirects.  Maybe the only reason content policy matters is because it does a getInterface for an IID that the CORS listener proxy and nsEventSource don't provide?  Or maybe it triggers a redirect where we didn't have one before?  Looking at a few crashes manually (see broken correlations bit), it looks like there's no particular extension the users have; in fact some have no extensions at all.

In any case, the reason why I think redirect handling is busted is this.  nsEventSource::OnRedirectVerifyCallback calls SetupHttpChannel(), which does this:

871   nsCOMPtr<nsIInterfaceRequestor> notificationCallbacks;
872   mHttpChannel->GetNotificationCallbacks(getter_AddRefs(notificationCallbacks));
873   if (notificationCallbacks != this) {
874     mNotificationCallbacks = notificationCallbacks;
875     mHttpChannel->SetNotificationCallbacks(this);
876   }

But mHttpChannel at that point is the _post_redirect_ channel.  And HttpBaseChannel::SetupReplacementChannel does

1571   newChannel->SetNotificationCallbacks(mCallbacks);

before any redirect observers are notified.  So in the code above |notificationCallbacks| are whatever the callbacks were on the pre-redirect channel.

And what are those?  Well, websocket code sets up the channel in nsEventSource::InitChannelAndRequestEventSource as follows:

917   rv = SetupHttpChannel();
918   NS_ENSURE_SUCCESS(rv, rv);
919 
920   nsCOMPtr<nsIStreamListener> listener =
921     new nsCORSListenerProxy(this, mPrincipal, mHttpChannel,
922                             mWithCredentials, &rv);
923   NS_ENSURE_SUCCESS(rv, rv);

After line 917, the notificationCallbacks on the channel are the nsEventSource.  After the nsCORSListenerProxy ctor runs, the notificationCallbacks are the nsCORSListenerProxy, chaining back to the nsEventSource via its mOuterNotificationCallbacks member.

So when we hit the redirect, the notificationCallbacks are the nsCORSListenerProxy, which is != this, so we set |this| as the notificationCallbacks again, chaining back to the nsCORSListenerProxy.  And now we have a loop in the notificationCallbacks chain.

We should try moving the code that sets the notification callbacks out of SetupHTTPChannel and into nsEventSource::InitChannelAndRequestEventSource and see if that fixes things.  But it'd also be nice to understand why/how this redirect stuff can get triggered and why content policy affects it.
Comment 4 Robert Kaiser 2012-01-15 08:57:30 PST
(In reply to Boris Zbarsky (:bz) from comment #3)
> ccing kairo, because the correlations stuff in soccoro seems to be broken
> and it would be really handy right now...

I don't know about it being broken, but it doesn't run for all versions due to its current hacky implementation (improvements are planned), and it's acting in fashions that are not always trivial to understand without being told about them. What didn't look right for you? What version(s) did you try them for?
Comment 5 Boris Zbarsky [:bz] (still a bit busy) 2012-01-15 13:28:56 PST
Hmm.  Maybe I just misunderstood the UI?  When I look _under_ the "add-ons" bit it says "Loaded Correlation Data, but none available for this signature nsEventSource::GetInterface(nsID const&, void**)".  Under Add-ons by versions, clicking "Load" just complains about undefined indices.  See https://crash-stats.mozilla.com/report/index/c13e9360-4325-454b-ba40-be22c2120113 for an example.
Comment 6 Scoobidiver (away) 2012-01-15 14:15:39 PST
(In reply to Boris Zbarsky (:bz) from comment #5)
> Hmm.  Maybe I just misunderstood the UI?  When I look _under_ the "add-ons"
> bit it says "Loaded Correlation Data, but none available for this signature
> nsEventSource::GetInterface(nsID const&, void**)".
Correlations for a version are only available when there is enough data, i.e. for 9.0.1 and 10.0b4 (about two weeks before the release). There are no correlations for Aurora, Nightly or Fennec 9.0.
See: https://crash-analysis.mozilla.com/crash_analysis/20120115/
Comment 7 Alex Keybl [:akeybl] 2012-02-06 16:53:30 PST
(In reply to Scoobidiver from comment #6)
> Correlations for a version are only available when there is enough data,
> i.e. for 9.0.1 and 10.0b4 (about two weeks before the release). There are no
> correlations for Aurora, Nightly or Fennec 9.0.
> See: https://crash-analysis.mozilla.com/crash_analysis/20120115/

Now that FF11 is on Beta, are the correlation reports of more use?
Comment 8 Scoobidiver (away) 2012-02-07 00:09:22 PST
(In reply to Alex Keybl [:akeybl] from comment #7)
> Now that FF11 is on Beta, are the correlation reports of more use?
The FF10 correlations are still broken (see bug 724799).
The FF11 correlations are not statistically representative (20 crashes a days) and give:
  nsEventSource::GetInterface(nsID const&, void**)|EXCEPTION_STACK_OVERFLOW (20 crashes)
     75% (15/20) vs.  45% (7284/16009) firefox-hotfix@mozilla.org
     20% (4/20) vs.   2% (267/16009) {a0d7ccb3-214d-498b-b4aa-0e8fda9a7bf7} (WOT, https://addons.mozilla.org/addon/3456)
     15% (3/20) vs.   0% (5/16009) fastdialfx6@rouing3.addons.mozilla.org
     15% (3/20) vs.   0% (14/16009) xpirftoolbar@roboform.com
     15% (3/20) vs.   1% (196/16009) firebug@software.joehewitt.com (Firebug, https://addons.mozilla.org/addon/1843)
     15% (3/20) vs.   1% (196/16009) ffxtlbr@Facemoods.com
     15% (3/20) vs.   2% (259/16009) {1018e4d6-728f-4b20-ad56-37578a4de76b} (Flagfox, https://addons.mozilla.org/addon/5791)
     15% (3/20) vs.   2% (359/16009) {23fcfd51-4958-4f00-80a3-ae97e717ed8b}
     15% (3/20) vs.   3% (407/16009) {D4DD63FA-01E4-46a7-B6B1-EDAB7D6AD389} (Download Statusbar, https://addons.mozilla.org/addon/26)
     15% (3/20) vs.   5% (731/16009) wrc@avast.com
     10% (2/20) vs.   0% (8/16009) doudehou@gmail.com (StatusbarEx, https://addons.mozilla.org/addon/3271)
     10% (2/20) vs.   0% (21/16009) toolbar@alexa.com (Alexa Sparky, https://addons.mozilla.org/addon/5362)
     10% (2/20) vs.   0% (26/16009) info@djzig.com (LavaFox V1, https://addons.mozilla.org/addon/11861)
     10% (2/20) vs.   0% (31/16009) quickstores@quickstores.de
     10% (2/20) vs.   0% (60/16009) cacaoweb@cacaoweb.org
     10% (2/20) vs.   1% (106/16009) vshare@toolbar
     10% (2/20) vs.   1% (131/16009) 64ffxtbr@TelevisionFanatic.com
     20% (4/20) vs.  11% (1742/16009) {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d} (Adblock Plus, https://addons.mozilla.org/addon/1865)
     10% (2/20) vs.   2% (249/16009) {EB9394A3-4AD6-4918-9537-31A1FD8E8EDF}
     10% (2/20) vs.   2% (265/16009) {99079a25-328f-4bd4-be04-00955acaa0a7}
     10% (2/20) vs.   3% (532/16009) {e4a8a97b-f2ed-450b-b12d-ee082ba24781} (Greasemonkey, https://addons.mozilla.org/addon/748)

  nsEventSource::GetInterface(nsID const&, void**)|EXCEPTION_STACK_OVERFLOW (20 crashes)
     35% (7/20) vs.  12% (1895/16009) GrooveShellExtensions.dll (MS Office)
          5% (1/20) vs.   5% (770/16009) 12.0.4518.1014
         10% (2/20) vs.   6% (921/16009) 12.0.6421.1000
         20% (4/20) vs.   1% (187/16009) 12.0.6500.5000
     25% (5/20) vs.   8% (1302/16009) Aavm4h.dll (Avast AV)
         15% (3/20) vs.   1% (133/16009) 6.0.1289.0
         10% (2/20) vs.   7% (1163/16009) 6.0.1367.0
     15% (3/20) vs.   2% (360/16009) RocketDock.dll (RocketDock)
     15% (3/20) vs.   3% (547/16009) DropboxExt.14.dll (1.0.0.14) (DropBox)
     10% (2/20) vs.   0% (50/16009) 64radio.dll (1.0.0.9) (RadioControl)
     10% (2/20) vs.   4% (687/16009) mgAdaptersProxy.dll (3.6.0.2) (SweetIM)
Comment 9 Scoobidiver (away) 2012-02-13 08:00:22 PST
With combined signatures, it's #48 top browser crasher in 11.0b1.

Here are fresh 11 correlations:
  nsEventSource::GetInterface(nsID const&, void**)|EXCEPTION_STACK_OVERFLOW (25 crashes)
     44% (11/25) vs.   6% (1253/22290) datamngr.dll (SearchQu Toolbar)
         40% (10/25) vs.   5% (1060/22290) 1.0.0.1
          4% (1/25) vs.   0% (21/22290) 3.0.0.46593
     32% (8/25) vs.   9% (2080/22290) GrooveUtil.DLL (MS Office)
          8% (2/25) vs.   4% (945/22290) 12.0.4518.1014
          4% (1/25) vs.   0% (44/22290) 12.0.6550.5004
         16% (4/25) vs.   3% (719/22290) 12.0.6562.5000
          4% (1/25) vs.   1% (269/22290) 12.0.6606.1000
     20% (5/25) vs.   0% (8/22290) PCIHOOKS.DLL (NetSupport Manager)
         20% (5/25) vs.   0% (5/22290) 10.60.0.391
     16% (4/25) vs.   0% (28/22290) APSHook.dll (Cognizance Identity Manager)
         16% (4/25) vs.   0% (12/22290) 2.0.0.15
    16% (4/25) vs.   1% (121/22290) DpOSet.dll (HP ProtectTools Security Manager)
         16% (4/25) vs.   0% (17/22290) 5.1.0.334
     16% (4/25) vs.   5% (1128/22290) GROOVEEX.DLL (MS Office 2010)
         16% (4/25) vs.   3% (733/22290) 14.0.6106.5000
     12% (3/25) vs.   1% (321/22290) GoogleDesktopCommon.dll (Google Desktop)
         12% (3/25) vs.   1% (290/22290) 5.9.1005.12335
Comment 10 Alex Keybl [:akeybl] 2012-02-13 14:27:30 PST
(In reply to Boris Zbarsky (:bz) from comment #5)
> Hmm.  Maybe I just misunderstood the UI?  When I look _under_ the "add-ons"
> bit it says "Loaded Correlation Data, but none available for this signature
> nsEventSource::GetInterface(nsID const&, void**)".  Under Add-ons by
> versions, clicking "Load" just complains about undefined indices.  See
> https://crash-stats.mozilla.com/report/index/c13e9360-4325-454b-ba40-
> be22c2120113 for an example.

Hi Boris, were you looking for the correlation reports to get a better idea of the types of 3rd party software that may be triggering this regression? Or would you like QA to help attempt to reproduce? If you'd like QA's help, any tips on what they should focus on when trying to reproduce (other than the above add-ons) would be greatly appreciated. Thanks!
Comment 11 Boris Zbarsky [:bz] (still a bit busy) 2012-02-13 15:09:05 PST
There weren't any correlation reports that I could find.

But in any case, comment 3 explains why this is happening and what we need to do to fix it....  The only thing that's unclear is why these add-ons trigger it.

Has QA tried using the relevant addons while loading websocket testcases?
Comment 12 Alex Keybl [:akeybl] 2012-02-15 16:52:29 PST
(In reply to Boris Zbarsky (:bz) from comment #11)
> Has QA tried using the relevant addons while loading websocket testcases?

Adding qawanted, but I'm sure they'll ask to be pointed at applicable testcases.

Let's install WOT, fastdialfx6, and Roboform when performing the testing.
Comment 13 Sheila Mooney 2012-02-21 21:57:34 PST
I will try and get someone in QA to help - Marcia is away for a few days.
Comment 14 Jason Duell [:jduell] (needinfo me) 2012-02-23 01:51:58 PST
bz is probably right about this being caused by a loop where nsEventSource and nsCORSListenerProxy wind up being each other's callbacks.  But I don't think this has anything to do with websockets (other than that it seems likely that the websockets Content Policy patch for bug 692067 somehow causes this to happen: I don't see any other likely candidates in the regression window).

Specifically, the scenario in comment 3 doesn't happen for websockets as far as i can tell (I've been trying to repro in the debugger, and don't see it).

>  Well, websocket code sets up the channel in 
> nsEventSource::InitChannelAndRequestEventSource as follows:

No it doesn't.  nsEventSource isn't involved in websocket setup at all.  I've been running ws tests through the debugger and we don't ever call any nsEventSource functions during them.  Actually, we don't seem to call nsEventSource functions for anything in the browser--I surfed nytimes.com, facebook, etc w/o ever hitting them.  I assume that one needs to hit a page with JS that explicitly instantiates an "EventSource" object to actually invoke any of this code?
Comment 15 Jason Duell [:jduell] (needinfo me) 2012-02-23 02:05:24 PST
Are we sure that this line in the patch for bug 692067 was right?

   +  csp._MAPPINGS[cp.TYPE_WEBSOCKET]         = cspr_sd.XHR_SRC;

The only other theory I've got from looking at the patch is that the NS_CheckContentLoadPolicy call in WebSocketChannel:Init somehow messes up things for EventSource objects--seems like a stretch, but I don't know this code much.  Tomorrow I'll try running the existing mochitest for Eventsource (test_bug338583.html), perhaps adding a websocket to it.  Unless someone has a better idea.
Comment 16 Jason Duell [:jduell] (needinfo me) 2012-02-23 02:06:54 PST
cc-ing the patch author/reviewers of bug 692067, since this seems to be the likely culprit.
Comment 17 Jason Duell [:jduell] (needinfo me) 2012-02-23 02:09:09 PST
khuey/smaug,

I guess you were already cc'd on this.  Could you look over comment 3, 14/15 and see if there's anything obvious in your patch that might be causing this?
Comment 18 Patrick McManus [:mcmanus] 2012-02-23 05:34:13 PST
hey jason, is the content policy triggering any kind of redirect? It's totally posislbe that is busted in websockets because normal redirects are prohibited by websockets so that code isn't run. (but for example, I think STS could run it.)
Comment 19 Boris Zbarsky [:bz] (still a bit busy) 2012-02-23 08:11:20 PST
Some content policies (HTTPS everywhere, noscript) definitely trigger redirects by hand.
Comment 20 juan becerra [:juanb] 2012-02-23 10:06:29 PST
I have Fx11b4(build1) on Windows 7 running with WOT, NoScript, Adblock Plus, HTTPS Everywhere and Fast Dial 4.0 and I have been trying to crash the application by going to different sites like Google properties, loging in, checking/sending mail, and such. I've also tried testing with websockets test suites that I found on google, the kind that ask you to load a page and it runs a script with a pass/fail result at the end. I've also tried other demos.

I haven't been able to crash yet.
Comment 21 Jason Duell [:jduell] (needinfo me) 2012-02-23 11:27:21 PST
juan,

It looks like this issue will require that a web page use EventSource (which is fairly rare AFAICT: I can't point you at any I know of).  And possibly also websockets?  Not clear yet.  I'm going to try to write a mochitest that exposes the crash.
Comment 22 Jason Duell [:jduell] (needinfo me) 2012-02-23 16:57:22 PST
Created attachment 600241 [details] [diff] [review]
Fix and mochitest

So this is indeed an EventSource-only bug, not websockets.

The fix is as bz specified in comment 3: just move setting of callbacks.

Test is also attached.

It appears that bug 692067 actually has nothing to do with this.  I tried backing that patch out and running my test, and we get the same infinite loop.
AFACIT redirects for EventSource were just broken, and we started encountering them in the wild just as we landed 692067.  That's my best guess, at least.

This is a simple fix, and I'd recommend we land it on aurora/beta (not setting approval flags 'cause I don't have review yet: bz feel free to do the honors.)
Comment 23 Patrick McManus [:mcmanus] 2012-02-23 17:23:54 PST
awesome jason.
Comment 24 Boris Zbarsky [:bz] (still a bit busy) 2012-02-23 17:32:47 PST
Comment on attachment 600241 [details] [diff] [review]
Fix and mochitest

r=me
Comment 25 Jason Duell [:jduell] (needinfo me) 2012-02-24 11:40:51 PST
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c1042e30503
Comment 26 Jason Duell [:jduell] (needinfo me) 2012-02-24 11:44:54 PST
Comment on attachment 600241 [details] [diff] [review]
Fix and mochitest

I'd suggest we take this on esr too--how do I nominate for that?

Regression caused by (bug #): Appears to have been in the source from the beginning, but is only now being hit by websites (as we see the EventSource API being picked up more widely on the web)

User impact if declined: Any site that uses an EventSource with a URI that redirects will cause browser to go into infinite loop then crash.

Testing completed (on m-c, etc.):  mochitest

Risk to taking this patch (and alternatives if risky):  very low risk.

String changes made by this patch: no
Comment 27 Scoobidiver (away) 2012-02-24 13:35:57 PST
(In reply to Jason Duell (:jduell) from comment #26)
> I'd suggest we take this on esr too--how do I nominate for that?
> Regression caused by (bug #): Appears to have been in the source from the 
> beginning, but is only now being hit by websites (as we see the EventSource API
> being picked up more widely on the web)
I see only one crash in 10.0 and one in 10.0.2 while there are 178 crashes in 11.0b3 for combined signatures, so there's a patch in the regression range that make it more hittable.
Comment 28 Marco Bonardo [::mak] 2012-02-25 02:20:53 PST
https://hg.mozilla.org/mozilla-central/rev/9c1042e30503
Comment 29 Jason Duell [:jduell] (needinfo me) 2012-02-27 11:21:28 PST
This changeset is almost the culprit (added the CORS listener that is needed for the infinite loop)

changeset:   81500:ed783dfd8179
user:        Wellington Fernando de Macedo <wfernandom2004@gmail.com>
date:        Mon Dec 05 21:02:42 2011 -0200
summary:     Bug 664179 - Allow Cross-Origin URLs in EventSource (Server-Sent Events). r=sicking

It's in aurora/beta, but not in esr.
Comment 30 Jason Duell [:jduell] (needinfo me) 2012-02-27 11:21:59 PST
s/almost/almost certainly/ :)
Comment 31 Lukas Blakk [:lsblakk] use ?needinfo 2012-02-27 15:56:02 PST
Comment on attachment 600241 [details] [diff] [review]
Fix and mochitest

[Triage Comment]
Accepted based on low risk and no crash reports since it landed
Comment 33 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-03-01 13:41:58 PST
Removing qawanted since this is marked fixed. Will be verified in Beta6.
Comment 34 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-03-05 16:05:22 PST
Given that QA was never able to reproduce this, what can we do to help verify the fix?
Comment 35 Jason Duell [:jduell] (needinfo me) 2012-03-05 17:50:41 PST
I added a mochitest as part of my fix that clearly repro's the issue.  I'd consider it verified.
Comment 36 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-03-05 17:53:25 PST
Marking qa- given comment 35.

Note You need to log in before you can comment on or make changes to this bug.