Closed Bug 856748 Opened 11 years ago Closed 8 years ago

Limit rogue sites/addons from flooding XHR connections

Categories

(Core :: Networking, defect)

21 Branch
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: merike, Unassigned)

Details

This is not the first time it has happened to me on this machine with my current profile so filing in the hopes of finding out where the problem lies.

Sometimes I middle-click a link or Ctrl+T to a new tab and enter url and suddenly Firefox won't load the page any more. Today I'm left with 6 such tabs, 4 of which are spinning backwards (connecting) and two forwards (loading). I suspect correlation with smartcard usage i.e it seems to hang when I've used smartcard for a while or after removing smartcard and continuing to use Firefox. Firefox UI stays responsive, I'm able to open/close tabs and so on but no url will load despite other programs being able to connect and load pages just fine. Restarting Firefox helps.

According to gdb's "bt" current thread on Aurora 20130328042013 looks like:
#0  0xb7745424 in __kernel_vsyscall ()
#1  0xb750f5f0 in poll () from /lib/i386-linux-gnu/libc.so.6
#2  0xb4ad1a7b in g_poll () from /lib/i386-linux-gnu/libglib-2.0.so.0
#3  0xb649e068 in PollWrapper (ufds=0xa3831600, nfsd=7, timeout_=-1)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/widget/gtk2/nsAppShell.cpp:35
#4  0xb4ac40ae in ?? () from /lib/i386-linux-gnu/libglib-2.0.so.0
#5  0xb4ac4201 in g_main_context_iteration () from /lib/i386-linux-gnu/libglib-2.0.so.0
#6  0xb649e024 in nsAppShell::ProcessNextNativeEvent (this=0xaf633060, mayWait=true)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/widget/gtk2/nsAppShell.cpp:135
#7  0xb64a4458 in nsBaseAppShell::DoProcessNextNativeEvent (this=0xaf633060, mayWait=true, 
    recursionDepth=0)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/widget/xpwidgets/nsBaseAppShell.cpp:139
#8  0xb64a45f3 in nsBaseAppShell::OnProcessNextEvent (this=0xaf633060, thr=0xb728cb80, 
    mayWait=<optimized out>, recursionDepth=0)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/widget/xpwidgets/nsBaseAppShell.cpp:298
#9  0xb659cbc5 in nsThread::ProcessNextEvent (this=0xb728cb80, mayWait=true, result=0xbfe7e86f)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/xpcom/threads/nsThread.cpp:600
#10 0xb6571489 in NS_ProcessNextEvent_P (thread=0xb728cb80, mayWait=true)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/obj-firefox/xpcom/build/nsThreadUtils.cpp:238
#11 0xb64e0f77 in mozilla::ipc::MessagePump::Run (this=0xb72e9be0, aDelegate=0xb724b8a0)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/ipc/glue/MessagePump.cpp:117
#12 0xb65c2fdf in RunInternal (this=0xb724b8a0)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/ipc/chromium/src/base/message_loop.cc:215
#13 RunHandler (this=0xb724b8a0)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/ipc/chromium/src/base/message_loop.cc:208
#14 MessageLoop::Run (this=0xb724b8a0)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/ipc/chromium/src/base/message_loop.cc:182
#15 0xb564acf6 in nsBaseAppShell::Run (this=0xaf633060)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/widget/xpwidgets/nsBaseAppShell.cpp:163
#16 0xb555c23d in nsAppStartup::Run (this=0xb2bbb430)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/toolkit/components/startup/nsAppStartup.cpp:288
#17 0xb50292c2 in XREMain::XRE_mainRun (this=0xbfe7eb14)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/toolkit/xre/nsAppRunner.cpp:3871
#18 0xb502c895 in XREMain::XRE_main (this=0xbfe7eb14, argc=3, argv=0xbfe7fe24, aAppData=0xbfe7ec5c)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/toolkit/xre/nsAppRunner.cpp:3938
#19 0xb502ca16 in XRE_main (argc=3, argv=0xbfe7fe24, aAppData=0xbfe7ec5c, aFlags=0)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/toolkit/xre/nsAppRunner.cpp:4141
#20 0x0804fe0d in do_main (argc=3, argv=0xbfe7fe24, xreDirectory=0xb7229280)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/browser/app/nsBrowserApp.cpp:224
#21 0x08050149 in main (argc=3, argv=0xbfe7fe24)
    at /builds/slave/m-aurora-lx-ntly-0000000000000/build/browser/app/nsBrowserApp.cpp:521

Last time it happened I was able to trigger a chrome hang by using cpulimit on Firefox process https://crash-stats.mozilla.com/report/index/bp-d86e680e-917e-4d98-8e36-abb002130326 Not sure if this helps but this time I couldn't trigger it.

Is there any other information I could get to help pinpoint this issue? It happens rarely enough that running binary search in enabled addons would take half a year or so if one of these is causing it. But still often enough that I'd be able to get additional info within couple weeks when it occurs again.
I may have hit the same problem in the last week as well. I think it is somehow related to using my slow 3G connection via tethering. If connecting/waiting for data times out Firefox' CPU usage goes up to 100% and nothing except the UI works until a restart.
FTR, attaching a debugger did not help at all, I have no idea why Firefox consumed 100% CPU. The backtrace was unfortunately as uninteresting as the chrome hang profile mentioned in comment 0.
I've had the same issue with Nightly 23 on two different machines in the last week
I just had Firefox stop loading webpages for me on OS X and netstat showed 525 established connections to EFF's SSL observatory server. I have the HTTPS Everywhere extension installed and apparently the observatory was turned on. I've now turned the observatory off to see if the problem goes away. I suspect I was hitting a connection limit.

The CPU usage can sometimes be caused by the loading throbber. I didn't pay attention to my CPU usage.
OS: Linux → All
Hardware: x86 → All
That sounds like a bug in the HTTPS Everywhere addon.  bsmith/patrick, do either of you have any contacts with the developers?

Note that network.http.max-connections is 256, so I'm not sure what mechanism the addon is using to get 525 connections (!) to the server.

I'd be curious to know if the EFF server is experiencing high connection load :)
(In reply to Jason Duell (:jduell) from comment #5)
> That sounds like a bug in the HTTPS Everywhere addon.  bsmith/patrick, do
> either of you have any contacts with the developers?

I CC'd Peter.
Hi.  We believe this is a bug in the SSL Observatory feature of HTTPS Everywhere that's triggered when that server is down or overloaded.  We're hoping to have to have a fix that's shippable over the weekend.
Sorry to hijack this bug as I don't know whether merike, ttaubert or vladan use that extension but is there something better necko can do when we hit the upper limit?
I'm not using this extension but can certainly check next time if I have many connections open and to where.
(In reply to Merike (:merike) from comment #9)
> I'm not using this extension but can certainly check next time if I have
> many connections open and to where.

Our code that triggers this bug uses XMLHTTPRequests.  Although we do some privileged things with them (setting headers and munging proxies if the user has Tor available), they might not be semantically different to the XHRs generatable by ordinary content script (see
https://gitweb.torproject.org/https-everywhere.git/blob/refs/heads/stable:/src/components/ssl-observatory.js#l640 ) so I'd guess there's a chance that AJAX sites with overloaded servers could trigger the same phenomenon.
> Is there something better necko can do when we hit the upper limit?

We've talked about failing seemingly timed-out connections when we hit our connection limits.  We could also try to detect if a given host appears to be abusing XHR connections (we generally limit HTTP connections to 6 per host, but perhaps we're not enforcing XHR connections as part of that limit--in general that's actually a good idea since they may be AJAX/Comet lingering requests, and we don't want to choke off parallelism).  And perhaps we're even not counting them toward the 256 http-max-connections limit? (I can't think of how else we're getting 500+ of them).

The pruning connection stuff won't happen this quarter, but I could imagine a more targeted fix here that limits the # of XHR requests per host to something (<36, say, just to be conservatively in favor of XHR-happy sites that may be out there--it's the Internet :).

Patrick, do you have thoughts on this?  We can mark XHR requests with some new IDL method if we don't already have info on which HTTP requests are XHR.
Flags: needinfo?(mcmanus)
limiting scope of bug just to what we might do in Firefox code to avoid the EFF Observatory (and related cases of XHR flooding).

I'd take a patch here but not sure if it's worth expending a lot of resources since I expect an HTTPS everywhere addon fix and this hasn't come up a lot elsewhere.  There are a million ways sites/addons can DOS themselves...
Summary: Firefox stops loading web pages → Limit rogue sites/addons from flooding XHR connections
Peter: can you comment here when you've got an HTTPS everywhere fix?  Thanks.
Component: General → Networking
Product: Firefox → Core
(In reply to Jason Duell (:jduell) from comment #13)
> Peter: can you comment here when you've got an HTTPS everywhere fix?  Thanks.

HTTPS Everywhere fixes just shipped in our stable (3.2.1) and devlopment (4.0development.7) branches this afternoon:

https://gitweb.torproject.org/https-everywhere.git/commitdiff/2cca6790b997d30ce6ee333b1f9653168b8a86cd

https://gitweb.torproject.org/https-everywhere.git/commitdiff/33abb7f076c688f2bc7e3b9f2940f5a99b7d9de7

https://gitweb.torproject.org/https-everywhere.git/commitdiff/6e019d81ddf73449198799968e2f88946a86413c
(In reply to Jason Duell (:jduell) from comment #11)

> host, but perhaps we're not enforcing XHR connections as part of that
> limit-

xhr counts towards the quota. So the issue is something more subtle.
Flags: needinfo?(mcmanus)
FYI, I updated HTTPS-Everywhere to 3.2.1 today, and I still ran into timeout issues with Aurora.  I disabled SSL Observatory and that seemed to bring the browser back to life.
(In reply to Amy Rich [:arich] [:arr] from comment #16)
> FYI, I updated HTTPS-Everywhere to 3.2.1 today, and I still ran into timeout
> issues with Aurora.  I disabled SSL Observatory and that seemed to bring the
> browser back to life.

Hi Amy, any chance that browser session is still alive and you can paste any HTTPS Everywhere messages from your Error Console -> Messages tab?
Unfortunately I've had to restart Aurora (for other reasons) twice since then.
I haven't been able to reproduce this on Aurora (or Nightly). Amy, if you find the time to try to reproduce and can give us console logs that'd be great. If you edit the about:config setting and change extensions.https_everywhere.LogLevel to 1, that will give more verbose messages which are helpful to us. You can then restrict via the search function to just messages related to the observatory. Thanks!
(In reply to Dan Auerbach from comment #19)
> I haven't been able to reproduce this on Aurora (or Nightly). Amy, if you
> find the time to try to reproduce and can give us console logs that'd be
> great. If you edit the about:config setting and change
> extensions.https_everywhere.LogLevel to 1, 

Actually, I'd recommend loglevel 2.  Level 1 is extremely spammy.
I actually ran into the issue of Aurora hanging again, even with SSL Observatory disabled.  I took a look at the error console, and there was nothing revealing.  Taking a look at info messages just showed normal http requests happening veeeeery slowly and occasionally (nowhere near as many as there should have been).  I also checked netstat to verify that it wasn't opening up a ton of connections to the EFF anyway (it wasn't).

I've now disabled HTTPS-Everywhere all together to see if that solves the issue.  So far it's been running for almost a day (which is a record) without locking up.
The good/bad news is that I haven't experienced any more hangs since disabling HTTPS-Everywhere entirely.
(In reply to Amy Rich [:arich] [:arr] from comment #22)
> The good/bad news is that I haven't experienced any more hangs since
> disabling HTTPS-Everywhere entirely.

Maybe next week you could reenable HTTPS-E (without the Observatory) and see if the problems come back?  If the loglevel verbosity is set to 3 or 4 and you spool the output of Aurora to a file, we might be able to get some further clues if it hangs.

Note also that some kernel tweaking we did on Tuesday solved the server-side problems with the Observatory, so I expect the socket exhaustion problems to be non-reproducible at this point even if our attempts to squash them in the client didn't work.  Under heavy load the server's kernel was dropping a lot of inbound packets, which was leaving a lot of TCP connections in weird unexpected dangling states.
I don't think there is anything to be done at the necko layer here..
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.