Closed Bug 525284 Opened 10 years ago Closed 5 years ago

When cancelling a HTTP transaction, its half-open sockets should be abandoned

Categories

(Core :: Networking: HTTP, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: Natch, Assigned: ttaubert)

References

Details

Attachments

(2 files)

When you run mochitests like this:

TEST_PATH=browser/components/safebrowsing make -C $OBJDIR mochitest-browser-chrome

browser_bug400731.js will fail, since the actual "fake" attack page will load. My thinking is that the db doesn't have a chance to update. If run as part of the whole browser/components directory it will pass. Is there no better way to do this? I.e. write to the phishing db, make your own local page and navigate to that...
This still happens, the actual attack site "it's an attack" loads.
hey, it has been a few more years, I am running into this also while running the tests as a standalone directory.  In fact I am looking to do that for all browser-chrome tests in bug 992911.

this seems to be limited to windows debug (I recall seeing it on osx as well):
https://tbpl.mozilla.org/?tree=Try&rev=07d9eef3e4f0

here is a link to a win7 log:
https://tbpl.mozilla.org/php/getParsedLog.php?id=38287837&tree=Try

we are leaking a docshell until shutdown.

:jonhnath, you were the original author of this test, any thoughts on why it fails when run as only the safebrowsing subdirectory (effectively a single test)?
Blocks: run-by-dir
Flags: needinfo?(johnath)
I don't know the current state of the safebrowsing code, but I suspect the explanation goes:

* this test checks to ensure we get our warning interstitial pages when navigating to our beacon sites - reference sites guaranteed to be in the safebrowsing databases
* BUT tests run in a fresh profile and we don't ship a pre-populated safebrowsing DB since it would instantly be out of date
* So this test is, it seems, a race. During a normal run we have time to start fetching safebrowsing DBs and the test passes, but when running them in the directory alone, I'm guessing we (often) don't have time to perform the fetch, and so the beacon sites aren't yet blocked.

The most direct solution would be for the test to programmatically add the entries for the beacon sites to the DB before trying to navigate to them. That would let this test pass even when run in isolation. It would have the effect, though, of no longer exercising the background fetching from safebrowsing, so someone closer to the code would have to ensure that was tested elsewhere. Another option would be to block on some indication that the first background fetch had succeeded, but that might get into timeouts and generally be a real nuisance for people running small-subset testing.

Adding gcp, with my apologies for writing something racy in his module.
Flags: needinfo?(johnath)
Thanks johnath for the explanation here.  

This got me thinking a bit more. We do the best we can to limit outside network traffic, most of this is done with prefs, and we do some safebrowsing pref setting in automation (http://dxr.mozilla.org/mozilla-central/source/testing/profiles/prefs_general.js):
user_pref("browser.safebrowsing.gethashURL", "http://%(server)s/safebrowsing-dummy/gethash");
user_pref("browser.safebrowsing.updateURL", "http://%(server)s/safebrowsing-dummy/update");

%(server) is replaced with mochi.test:8888, and I don't believe that resolves to anything, possibly a 404.  Maybe gcp could help us ensure we have the right pre loading, prefs, fake server, etc. setup for testing. Could it be that setting these prefs still put something in the DB, we just wait for it to happen?  In that case it would make sense to wait for an event and then do actions.
SafeBrowsing does not, in fact, run any (outside) network access during the automated testing. The database is updated locally with test URLs without involving Google's server. The fact that we don't test using Google's server has lead to some very severe stop-the-press bugs, which we're now only starting to address with bug 967568.

The entries are added locally on startup here:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/SafeBrowsing.jsm#68

The problem is that updating that DB requires disk IO. Disk IO on startup is...bad: https://bugzilla.mozilla.org/show_bug.cgi?id=778855

As a result, the SafeBrowsing module currently forces a delay of 2 seconds after it's own initialization before it inserts the test URLs into the DB: http://mxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#1022

I'm going to guess that running the tests without accounting for this delay is causing the problem you're seeing.
gcp: I wonder if we could have tests for this in mozmill to account for live network access?

is there an event I could listen on (i.e. mozaftersafebrowsingupdated)?
>gcp: I wonder if we could have tests for this in mozmill to account for live network access?

That's what the linked bug 967568 is about.

>is there an event I could listen on (i.e. mozaftersafebrowsingupdated)?

No.
:gcp, I have verified on try server that doing a 5 second delay before starting the test works.  I know this is ugly- maybe you have some other advice on how to do this smarter?  could I query a database for values and if none show up within 5 seconds fail?  wait for other events?
Attachment #8411660 - Flags: feedback?(gpascutto)
Comment on attachment 8411660 [details] [diff] [review]
ugly hack to make this work via settimeout

Review of attachment 8411660 [details] [diff] [review]:
-----------------------------------------------------------------

>could I query a database for values and if none show up within 5 seconds fail?  wait for other events?

Querying a database would work but then you're testing something else entirely - seems pointless. I think we could add an event for the update, feel free to file a bug for that.

That being said, the xpcshell tests use delays of 3 seconds in various place, probably for the same reasons. So I wouldn't worry too much about it, and maybe reduce the delay to 3 if that seems to work. I would guess that as long it's >2s you'd be fine.
Attachment #8411660 - Flags: feedback?(gpascutto) → feedback+
hmm, while this appears to fix the problem, it doesn't in all cases.  On windows XP Debug, we still leak a docshell until shutdown with the 5 second delay:
https://tbpl.mozilla.org/php/getParsedLog.php?id=38479634&tree=Try

open to any thoughts- maybe something I could cleanup
Depends on: 1007687
Product: Firefox → Toolkit
This test is still failing for osx and windows debug bc1 jobs as can been seen in a recent try push: https://tbpl.mozilla.org/?tree=Try&rev=4d3f23cc8c3c

The goal is to turn these on as --run-by-dir in 2 weeks, we don't have too many tests that are problematic(19 total), if we can fix them now that will help! :)

Gian-Carlo, can you figure out why this test might be leaking for when running tests only in the directory: browser/components/safebrowsing/content/test ?
Flags: needinfo?(gpascutto)
I don't know offhand what's up aside from what I already said in the comments above.
Flags: needinfo?(gpascutto)
I have not had success with the notifications from bug 1007687.  My initial tests had the test disabled in another patch.

Testing this in more details, I have found that we always have safebrowsing initialized (we never wait for observers).

So given that fact we still have failures:
https://tbpl.mozilla.org/?tree=Try&rev=9e0dc5d6515a (win7 and win8)

I thought, why not wait 3 seconds before starting the test (as mentioned above):
https://tbpl.mozilla.org/?tree=Try&rev=78c857cae3db

doing that yields winxp perma failures.

I guess what can we do to figure out what is missing?  Can we do something like this:
* add observers
* for a db update
* wait for observer to be notified
* run test

that would ensure the database is updated, but I am not sure that is what is causing this failure.

:gcp, any further thoughts here?
Flags: needinfo?(jmaher)
heh, I am awesome at this bugzilla tool, I ni? myself!
Flags: needinfo?(jmaher) → needinfo?(gpascutto)
Unless I'm misunderstanding those logs show the SafeBrowsing tests themselves are passing, so I'm not sure your docShell issue is because the updates don't work.
Flags: needinfo?(gpascutto)
The tests pass, but the problem is we leak a docshell after shutdown (see comment 2).  I really don't know what to do here, but I am willing to keep hacking if given some direction.
Flags: needinfo?(gpascutto)
I'm not sure why you're asking me for direction :-) SafeBrowsing works. The problem is elsewhere.

I started looking at this (with the big disclaimer I know absolutely nothing about it) and as far as I can tell comment 2 and beyond are probably totally off in the wrong direction. The head.js file for these tests forces a SafeBrowsing.init(), which means the case from comment 0 and comment 1 can't exist because the test itself forces SafeBrowsing to initialize before starting. Looking at the hg blame for that file, ehsan fixed bug 800394 to do exactly this (and at that point this bug should've been closed as well). That's also what you observed in comment 13.

That still leaves the question why the test occasionally leaks a docShell on Windows if you run it in the specific way you indicate. I've looked at the test source, but I don't see any *obvious* ways in which it fails to close a tab or remove a listener.

Is this reproducible locally for you? Maybe you can physically see what's wrong?
Flags: needinfo?(gpascutto)
I have a loaner w7 machine and I haven't been able to reproduce this at all on there (running it manually, via mozharness, etc.).  The problem is this reproduces so easily when run on try server.

The issue is we leak a docshell window 100% of the time while running per directory.
gcp, I have tried hacking around in the test case- honestly I am not sure what to do next here.  It sounds like we are back to square one with this test case.

should we adjust our listeners for the other two tests like we do here:
http://hg.mozilla.org/mozilla-central/rev/dd2ae94d1e01
Flags: needinfo?(gpascutto)
(In reply to Joel Maher (:jmaher) from comment #19)
> should we adjust our listeners for the other two tests like we do here:
> http://hg.mozilla.org/mozilla-central/rev/dd2ae94d1e01

No, that's specific to deal with the about:blank page that may or may not be finished loading before the test gets to set the test page (see comment in the JS file). I don't see any way that the other tests could require it.
Flags: needinfo?(gpascutto)
Thanks for all your replies gcp!  With no progress on actually fixing this bug and no things to try, I am tempted to disable it for osx and windows debug.  This is way out of my knowledge domain- please let me know if there is something else to try vs your thoughts on disabling this test.
Maybe Tim can help? :)
Flags: needinfo?(ttaubert)
When a HTTP request is canceled by the channel classifier because we consider the URL malware we also cancel the HTTP transaction but do not remove any half-open sockets belonging to this transaction. This seems to cause the permanent docShell leaks we're seeing here and fixes them locally for me.

I'm not sure whether it's fine to always abandon half-open sockets when a transaction is canceled or whether we should only do that for blocked requests.
Attachment #8481765 - Flags: review?(mcmanus)
Attachment #8481765 - Flags: review?(hurley)
Flags: needinfo?(ttaubert)
Summary: browser_bug400731.js is fragile, not always passing. → Permanent "browser_bug400731.js | leaked 1 docShell(s) until shutdown" when running tests only in browser/components/safebrowsing/content/test/
Comment on attachment 8481765 [details] [diff] [review]
0001-Bug-525284-When-cancelling-a-HTTP-transaction-abando.patch

Review of attachment 8481765 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Tim,

This would probably actually eventually time out and not leak, but your patch is the intended behavior.

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +2299,5 @@
>              }
>          }
> +
> +        // Abandon all half-open sockets belonging to the given transaction.
> +        for (uint32_t index = 0;

can you just move this up inside the if (ent) block above and make it the else{} clause to the existing if (index >= 0) bit.. they should be mutually exclusive.
Attachment #8481765 - Flags: review?(mcmanus) → review+
(In reply to AFK Aug 22- Sep 01 Patrick McManus [:mcmanus] from comment #24)
> This would probably actually eventually time out and not leak, but your
> patch is the intended behavior.

Probably explains why we're only seeing these leaks in the run-by-dir case where the browser instance is much shorter-lived than it is now.

Thanks for the investigative work! Run-by-dir is almost ready to land :)
(In reply to AFK Aug 22- Sep 01 Patrick McManus [:mcmanus] from comment #24)
> can you just move this up inside the if (ent) block above and make it the
> else{} clause to the existing if (index >= 0) bit.. they should be mutually
> exclusive.

They don't seem mutually exclusive, I still hit the leak by making it the else-clause. I will move it inside |if (ent)| and make it iterate regardless of the index.

Thanks for the fast review!
Attachment #8481765 - Flags: review?(hurley)
Assignee: nobody → ttaubert
Status: NEW → ASSIGNED
Component: Phishing Protection → Networking: HTTP
Product: Toolkit → Core
Summary: Permanent "browser_bug400731.js | leaked 1 docShell(s) until shutdown" when running tests only in browser/components/safebrowsing/content/test/ → When cancelling a HTTP transaction, its half-open sockets should be abandoned
https://hg.mozilla.org/mozilla-central/rev/30372695cda3
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Depends on: 1104993
You need to log in before you can comment on or make changes to this bug.