Closed Bug 716841 Opened 13 years ago Closed 12 years ago

EventSource::GetInterface goes into infinite loop if EventSource hits an HTTP redirect

Categories

(Core :: Networking: WebSockets, defect)

11 Branch
All
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla13
Tracking Status
firefox11 + fixed
firefox12 + fixed
firefox-esr10 --- unaffected

People

(Reporter: scoobidiver, Assigned: jduell.mcbugs)

References

Details

(Keywords: crash, regression, Whiteboard: [qa-])

Crash Data

Attachments

(1 file)

It's #58 top browser crasher in 11.0a2 and #107 in 12.0a1.

It first appeared in 11.0a1/20111209. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6785d3003414&tochange=9e7239c0f557

Signature 	nsEventSource::GetInterface(nsID const&, void**) More Reports Search
UUID	9ee107dd-b569-4147-8fc3-a68c42120108
Date Processed	2012-01-08 17:11:51.776646
Uptime	23100
Last Crash	17.5 hours before submission
Install Age	6.4 hours since version was first installed.
Install Time	2012-01-08 19:47:45
Product	Firefox
Version	12.0a1
Build ID	20120108031024
Release Channel	nightly
OS	Windows NT
OS Version	5.1.2600 Service Pack 3
Build Architecture	x86
Build Architecture Info	GenuineIntel family 6 model 15 stepping 11
Crash Reason	EXCEPTION_STACK_OVERFLOW
Crash Address	0x16e539b
App Notes 	
AdapterVendorID: 0x10de, AdapterDeviceID: 0x0606, AdapterSubsysID: 23351682, AdapterDriverVersion: 6.14.12.8558
D3D10 Layers? D3D10 Layers-
D3D9 Layers? D3D9 Layers+
Processor Notes 	This dump is too long and has triggered the automatic truncation routine
EMCheckCompatibility	False

Crashing Thread
Frame 	Module 	Signature [Expand] 	Source
0 	xul.dll 	nsEventSource::GetInterface 	content/base/src/nsEventSource.cpp:767
1 	xul.dll 	nsCORSListenerProxy::GetInterface 	content/base/src/nsCrossSiteListenerProxy.cpp:651
2 	xul.dll 	nsEventSource::GetInterface 	content/base/src/nsEventSource.cpp:782
3 	xul.dll 	nsCORSListenerProxy::GetInterface 	content/base/src/nsCrossSiteListenerProxy.cpp:651
4 	xul.dll 	nsEventSource::GetInterface 	content/base/src/nsEventSource.cpp:782
5 	xul.dll 	nsCORSListenerProxy::GetInterface 	content/base/src/nsCrossSiteListenerProxy.cpp:651
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsEventSource%3A%3AGetInterface%28nsID%20const%26%2C%20void**%29
Summary: Crash @ nsEventSource::GetInterface(nsID const&, void**) → Crash @ nsEventSource::GetInterface
In the regression window, bug 692067 seems the likely culprit.
Blocks: 692067
Summary: Crash @ nsEventSource::GetInterface → Re-entrant calls to nsCORSListenerProxy::GetInterface leads to stack overflow and Crash @ nsEventSource::GetInterface
Component: General → Networking: WebSockets
QA Contact: general → networking.websockets
The crash also shows up as [@ nsCORSListenerProxy::GetInterface(nsID const&, void**) ]
so it should rank slightly higher on the top crash list.
https://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=nsCORSListenerProxy%3A%3AGetInterface&reason_type=contains&date=01%2F14%2F2012%2016%3A52%3A53&range_value=4&range_unit=weeks&hang_type=any&process_type=any&do_query=1&signature=nsCORSListenerProxy%3A%3AGetInterface%28nsID%20const%26%2C%20void**%29
Crash Signature: [@ nsEventSource::GetInterface(nsID const&, void**)] → [@ nsEventSource::GetInterface(nsID const&, void**)] [@ nsCORSListenerProxy::GetInterface(nsID const&, void**) ]
ccing kairo, because the correlations stuff in soccoro seems to be broken and it would be really handy right now...

I'm not sure why the content policy bit would be relevant here, but the websocket code's notification callbacks handling seems completely busted in the face of redirects.  Maybe the only reason content policy matters is because it does a getInterface for an IID that the CORS listener proxy and nsEventSource don't provide?  Or maybe it triggers a redirect where we didn't have one before?  Looking at a few crashes manually (see broken correlations bit), it looks like there's no particular extension the users have; in fact some have no extensions at all.

In any case, the reason why I think redirect handling is busted is this.  nsEventSource::OnRedirectVerifyCallback calls SetupHttpChannel(), which does this:

871   nsCOMPtr<nsIInterfaceRequestor> notificationCallbacks;
872   mHttpChannel->GetNotificationCallbacks(getter_AddRefs(notificationCallbacks));
873   if (notificationCallbacks != this) {
874     mNotificationCallbacks = notificationCallbacks;
875     mHttpChannel->SetNotificationCallbacks(this);
876   }

But mHttpChannel at that point is the _post_redirect_ channel.  And HttpBaseChannel::SetupReplacementChannel does

1571   newChannel->SetNotificationCallbacks(mCallbacks);

before any redirect observers are notified.  So in the code above |notificationCallbacks| are whatever the callbacks were on the pre-redirect channel.

And what are those?  Well, websocket code sets up the channel in nsEventSource::InitChannelAndRequestEventSource as follows:

917   rv = SetupHttpChannel();
918   NS_ENSURE_SUCCESS(rv, rv);
919 
920   nsCOMPtr<nsIStreamListener> listener =
921     new nsCORSListenerProxy(this, mPrincipal, mHttpChannel,
922                             mWithCredentials, &rv);
923   NS_ENSURE_SUCCESS(rv, rv);

After line 917, the notificationCallbacks on the channel are the nsEventSource.  After the nsCORSListenerProxy ctor runs, the notificationCallbacks are the nsCORSListenerProxy, chaining back to the nsEventSource via its mOuterNotificationCallbacks member.

So when we hit the redirect, the notificationCallbacks are the nsCORSListenerProxy, which is != this, so we set |this| as the notificationCallbacks again, chaining back to the nsCORSListenerProxy.  And now we have a loop in the notificationCallbacks chain.

We should try moving the code that sets the notification callbacks out of SetupHTTPChannel and into nsEventSource::InitChannelAndRequestEventSource and see if that fixes things.  But it'd also be nice to understand why/how this redirect stuff can get triggered and why content policy affects it.
(In reply to Boris Zbarsky (:bz) from comment #3)
> ccing kairo, because the correlations stuff in soccoro seems to be broken
> and it would be really handy right now...

I don't know about it being broken, but it doesn't run for all versions due to its current hacky implementation (improvements are planned), and it's acting in fashions that are not always trivial to understand without being told about them. What didn't look right for you? What version(s) did you try them for?
Hmm.  Maybe I just misunderstood the UI?  When I look _under_ the "add-ons" bit it says "Loaded Correlation Data, but none available for this signature nsEventSource::GetInterface(nsID const&, void**)".  Under Add-ons by versions, clicking "Load" just complains about undefined indices.  See https://crash-stats.mozilla.com/report/index/c13e9360-4325-454b-ba40-be22c2120113 for an example.
(In reply to Boris Zbarsky (:bz) from comment #5)
> Hmm.  Maybe I just misunderstood the UI?  When I look _under_ the "add-ons"
> bit it says "Loaded Correlation Data, but none available for this signature
> nsEventSource::GetInterface(nsID const&, void**)".
Correlations for a version are only available when there is enough data, i.e. for 9.0.1 and 10.0b4 (about two weeks before the release). There are no correlations for Aurora, Nightly or Fennec 9.0.
See: https://crash-analysis.mozilla.com/crash_analysis/20120115/
Crash Signature: [@ nsEventSource::GetInterface(nsID const&, void**)] [@ nsCORSListenerProxy::GetInterface(nsID const&, void**) ] → [@ nsEventSource::GetInterface(nsID const&, void**)] [@ nsCORSListenerProxy::GetInterface(nsID const&, void**)]
(In reply to Scoobidiver from comment #6)
> Correlations for a version are only available when there is enough data,
> i.e. for 9.0.1 and 10.0b4 (about two weeks before the release). There are no
> correlations for Aurora, Nightly or Fennec 9.0.
> See: https://crash-analysis.mozilla.com/crash_analysis/20120115/

Now that FF11 is on Beta, are the correlation reports of more use?
Depends on: 724799
(In reply to Alex Keybl [:akeybl] from comment #7)
> Now that FF11 is on Beta, are the correlation reports of more use?
The FF10 correlations are still broken (see bug 724799).
The FF11 correlations are not statistically representative (20 crashes a days) and give:
  nsEventSource::GetInterface(nsID const&, void**)|EXCEPTION_STACK_OVERFLOW (20 crashes)
     75% (15/20) vs.  45% (7284/16009) firefox-hotfix@mozilla.org
     20% (4/20) vs.   2% (267/16009) {a0d7ccb3-214d-498b-b4aa-0e8fda9a7bf7} (WOT, https://addons.mozilla.org/addon/3456)
     15% (3/20) vs.   0% (5/16009) fastdialfx6@rouing3.addons.mozilla.org
     15% (3/20) vs.   0% (14/16009) xpirftoolbar@roboform.com
     15% (3/20) vs.   1% (196/16009) firebug@software.joehewitt.com (Firebug, https://addons.mozilla.org/addon/1843)
     15% (3/20) vs.   1% (196/16009) ffxtlbr@Facemoods.com
     15% (3/20) vs.   2% (259/16009) {1018e4d6-728f-4b20-ad56-37578a4de76b} (Flagfox, https://addons.mozilla.org/addon/5791)
     15% (3/20) vs.   2% (359/16009) {23fcfd51-4958-4f00-80a3-ae97e717ed8b}
     15% (3/20) vs.   3% (407/16009) {D4DD63FA-01E4-46a7-B6B1-EDAB7D6AD389} (Download Statusbar, https://addons.mozilla.org/addon/26)
     15% (3/20) vs.   5% (731/16009) wrc@avast.com
     10% (2/20) vs.   0% (8/16009) doudehou@gmail.com (StatusbarEx, https://addons.mozilla.org/addon/3271)
     10% (2/20) vs.   0% (21/16009) toolbar@alexa.com (Alexa Sparky, https://addons.mozilla.org/addon/5362)
     10% (2/20) vs.   0% (26/16009) info@djzig.com (LavaFox V1, https://addons.mozilla.org/addon/11861)
     10% (2/20) vs.   0% (31/16009) quickstores@quickstores.de
     10% (2/20) vs.   0% (60/16009) cacaoweb@cacaoweb.org
     10% (2/20) vs.   1% (106/16009) vshare@toolbar
     10% (2/20) vs.   1% (131/16009) 64ffxtbr@TelevisionFanatic.com
     20% (4/20) vs.  11% (1742/16009) {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d} (Adblock Plus, https://addons.mozilla.org/addon/1865)
     10% (2/20) vs.   2% (249/16009) {EB9394A3-4AD6-4918-9537-31A1FD8E8EDF}
     10% (2/20) vs.   2% (265/16009) {99079a25-328f-4bd4-be04-00955acaa0a7}
     10% (2/20) vs.   3% (532/16009) {e4a8a97b-f2ed-450b-b12d-ee082ba24781} (Greasemonkey, https://addons.mozilla.org/addon/748)

  nsEventSource::GetInterface(nsID const&, void**)|EXCEPTION_STACK_OVERFLOW (20 crashes)
     35% (7/20) vs.  12% (1895/16009) GrooveShellExtensions.dll (MS Office)
          5% (1/20) vs.   5% (770/16009) 12.0.4518.1014
         10% (2/20) vs.   6% (921/16009) 12.0.6421.1000
         20% (4/20) vs.   1% (187/16009) 12.0.6500.5000
     25% (5/20) vs.   8% (1302/16009) Aavm4h.dll (Avast AV)
         15% (3/20) vs.   1% (133/16009) 6.0.1289.0
         10% (2/20) vs.   7% (1163/16009) 6.0.1367.0
     15% (3/20) vs.   2% (360/16009) RocketDock.dll (RocketDock)
     15% (3/20) vs.   3% (547/16009) DropboxExt.14.dll (1.0.0.14) (DropBox)
     10% (2/20) vs.   0% (50/16009) 64radio.dll (1.0.0.9) (RadioControl)
     10% (2/20) vs.   4% (687/16009) mgAdaptersProxy.dll (3.6.0.2) (SweetIM)
With combined signatures, it's #48 top browser crasher in 11.0b1.

Here are fresh 11 correlations:
  nsEventSource::GetInterface(nsID const&, void**)|EXCEPTION_STACK_OVERFLOW (25 crashes)
     44% (11/25) vs.   6% (1253/22290) datamngr.dll (SearchQu Toolbar)
         40% (10/25) vs.   5% (1060/22290) 1.0.0.1
          4% (1/25) vs.   0% (21/22290) 3.0.0.46593
     32% (8/25) vs.   9% (2080/22290) GrooveUtil.DLL (MS Office)
          8% (2/25) vs.   4% (945/22290) 12.0.4518.1014
          4% (1/25) vs.   0% (44/22290) 12.0.6550.5004
         16% (4/25) vs.   3% (719/22290) 12.0.6562.5000
          4% (1/25) vs.   1% (269/22290) 12.0.6606.1000
     20% (5/25) vs.   0% (8/22290) PCIHOOKS.DLL (NetSupport Manager)
         20% (5/25) vs.   0% (5/22290) 10.60.0.391
     16% (4/25) vs.   0% (28/22290) APSHook.dll (Cognizance Identity Manager)
         16% (4/25) vs.   0% (12/22290) 2.0.0.15
    16% (4/25) vs.   1% (121/22290) DpOSet.dll (HP ProtectTools Security Manager)
         16% (4/25) vs.   0% (17/22290) 5.1.0.334
     16% (4/25) vs.   5% (1128/22290) GROOVEEX.DLL (MS Office 2010)
         16% (4/25) vs.   3% (733/22290) 14.0.6106.5000
     12% (3/25) vs.   1% (321/22290) GoogleDesktopCommon.dll (Google Desktop)
         12% (3/25) vs.   1% (290/22290) 5.9.1005.12335
(In reply to Boris Zbarsky (:bz) from comment #5)
> Hmm.  Maybe I just misunderstood the UI?  When I look _under_ the "add-ons"
> bit it says "Loaded Correlation Data, but none available for this signature
> nsEventSource::GetInterface(nsID const&, void**)".  Under Add-ons by
> versions, clicking "Load" just complains about undefined indices.  See
> https://crash-stats.mozilla.com/report/index/c13e9360-4325-454b-ba40-
> be22c2120113 for an example.

Hi Boris, were you looking for the correlation reports to get a better idea of the types of 3rd party software that may be triggering this regression? Or would you like QA to help attempt to reproduce? If you'd like QA's help, any tips on what they should focus on when trying to reproduce (other than the above add-ons) would be greatly appreciated. Thanks!
There weren't any correlation reports that I could find.

But in any case, comment 3 explains why this is happening and what we need to do to fix it....  The only thing that's unclear is why these add-ons trigger it.

Has QA tried using the relevant addons while loading websocket testcases?
Assignee: nobody → jduell.mcbugs
(In reply to Boris Zbarsky (:bz) from comment #11)
> Has QA tried using the relevant addons while loading websocket testcases?

Adding qawanted, but I'm sure they'll ask to be pointed at applicable testcases.

Let's install WOT, fastdialfx6, and Roboform when performing the testing.
Keywords: qawanted
I will try and get someone in QA to help - Marcia is away for a few days.
bz is probably right about this being caused by a loop where nsEventSource and nsCORSListenerProxy wind up being each other's callbacks.  But I don't think this has anything to do with websockets (other than that it seems likely that the websockets Content Policy patch for bug 692067 somehow causes this to happen: I don't see any other likely candidates in the regression window).

Specifically, the scenario in comment 3 doesn't happen for websockets as far as i can tell (I've been trying to repro in the debugger, and don't see it).

>  Well, websocket code sets up the channel in 
> nsEventSource::InitChannelAndRequestEventSource as follows:

No it doesn't.  nsEventSource isn't involved in websocket setup at all.  I've been running ws tests through the debugger and we don't ever call any nsEventSource functions during them.  Actually, we don't seem to call nsEventSource functions for anything in the browser--I surfed nytimes.com, facebook, etc w/o ever hitting them.  I assume that one needs to hit a page with JS that explicitly instantiates an "EventSource" object to actually invoke any of this code?
Are we sure that this line in the patch for bug 692067 was right?

   +  csp._MAPPINGS[cp.TYPE_WEBSOCKET]         = cspr_sd.XHR_SRC;

The only other theory I've got from looking at the patch is that the NS_CheckContentLoadPolicy call in WebSocketChannel:Init somehow messes up things for EventSource objects--seems like a stretch, but I don't know this code much.  Tomorrow I'll try running the existing mochitest for Eventsource (test_bug338583.html), perhaps adding a websocket to it.  Unless someone has a better idea.
cc-ing the patch author/reviewers of bug 692067, since this seems to be the likely culprit.
khuey/smaug,

I guess you were already cc'd on this.  Could you look over comment 3, 14/15 and see if there's anything obvious in your patch that might be causing this?
hey jason, is the content policy triggering any kind of redirect? It's totally posislbe that is busted in websockets because normal redirects are prohibited by websockets so that code isn't run. (but for example, I think STS could run it.)
Some content policies (HTTPS everywhere, noscript) definitely trigger redirects by hand.
I have Fx11b4(build1) on Windows 7 running with WOT, NoScript, Adblock Plus, HTTPS Everywhere and Fast Dial 4.0 and I have been trying to crash the application by going to different sites like Google properties, loging in, checking/sending mail, and such. I've also tried testing with websockets test suites that I found on google, the kind that ask you to load a page and it runs a script with a pass/fail result at the end. I've also tried other demos.

I haven't been able to crash yet.
juan,

It looks like this issue will require that a web page use EventSource (which is fairly rare AFAICT: I can't point you at any I know of).  And possibly also websockets?  Not clear yet.  I'm going to try to write a mochitest that exposes the crash.
So this is indeed an EventSource-only bug, not websockets.

The fix is as bz specified in comment 3: just move setting of callbacks.

Test is also attached.

It appears that bug 692067 actually has nothing to do with this.  I tried backing that patch out and running my test, and we get the same infinite loop.
AFACIT redirects for EventSource were just broken, and we started encountering them in the wild just as we landed 692067.  That's my best guess, at least.

This is a simple fix, and I'd recommend we land it on aurora/beta (not setting approval flags 'cause I don't have review yet: bz feel free to do the honors.)
Attachment #600241 - Flags: review?(bzbarsky)
awesome jason.
Comment on attachment 600241 [details] [diff] [review]
Fix and mochitest

r=me
Attachment #600241 - Flags: review?(bzbarsky) → review+
Status: NEW → ASSIGNED
Component: Networking: WebSockets → Event Handling
QA Contact: networking.websockets → events
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c1042e30503
Assignee: jduell.mcbugs → nobody
Status: ASSIGNED → NEW
Component: Event Handling → Networking: WebSockets
QA Contact: events → networking.websockets
Summary: Re-entrant calls to nsCORSListenerProxy::GetInterface leads to stack overflow and Crash @ nsEventSource::GetInterface → EventSource::GetInterface goes into infinite loop if EventSource hits an HTTP redirect
Comment on attachment 600241 [details] [diff] [review]
Fix and mochitest

I'd suggest we take this on esr too--how do I nominate for that?

Regression caused by (bug #): Appears to have been in the source from the beginning, but is only now being hit by websites (as we see the EventSource API being picked up more widely on the web)

User impact if declined: Any site that uses an EventSource with a URI that redirects will cause browser to go into infinite loop then crash.

Testing completed (on m-c, etc.):  mochitest

Risk to taking this patch (and alternatives if risky):  very low risk.

String changes made by this patch: no
Attachment #600241 - Flags: approval-mozilla-beta?
Attachment #600241 - Flags: approval-mozilla-aurora?
(In reply to Jason Duell (:jduell) from comment #26)
> I'd suggest we take this on esr too--how do I nominate for that?
> Regression caused by (bug #): Appears to have been in the source from the 
> beginning, but is only now being hit by websites (as we see the EventSource API
> being picked up more widely on the web)
I see only one crash in 10.0 and one in 10.0.2 while there are 178 crashes in 11.0b3 for combined signatures, so there's a patch in the regression range that make it more hittable.
Assignee: nobody → jduell.mcbugs
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla13
https://hg.mozilla.org/mozilla-central/rev/9c1042e30503
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
This changeset is almost the culprit (added the CORS listener that is needed for the infinite loop)

changeset:   81500:ed783dfd8179
user:        Wellington Fernando de Macedo <wfernandom2004@gmail.com>
date:        Mon Dec 05 21:02:42 2011 -0200
summary:     Bug 664179 - Allow Cross-Origin URLs in EventSource (Server-Sent Events). r=sicking

It's in aurora/beta, but not in esr.
Depends on: 664179
s/almost/almost certainly/ :)
Blocks: 664179
No longer depends on: 664179
No longer blocks: 692067
No longer depends on: 724799
Comment on attachment 600241 [details] [diff] [review]
Fix and mochitest

[Triage Comment]
Accepted based on low risk and no crash reports since it landed
Attachment #600241 - Flags: approval-mozilla-beta?
Attachment #600241 - Flags: approval-mozilla-beta+
Attachment #600241 - Flags: approval-mozilla-aurora?
Attachment #600241 - Flags: approval-mozilla-aurora+
Removing qawanted since this is marked fixed. Will be verified in Beta6.
Keywords: qawanted
Given that QA was never able to reproduce this, what can we do to help verify the fix?
Whiteboard: [qa?]
I added a mochitest as part of my fix that clearly repro's the issue.  I'd consider it verified.
Marking qa- given comment 35.
Whiteboard: [qa?] → [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: