Last Comment Bug 628561 - DNS caching forcing one to restart Firefox to eliminate temporary DNS redirection
: DNS caching forcing one to restart Firefox to eliminate temporary DNS redirec...
Status: RESOLVED FIXED
[inbound]
:
Product: Core
Classification: Components
Component: Networking (show other bugs)
: unspecified
: x86 Linux
: -- normal (vote)
: mozilla8
Assigned To: Patrick McManus [:mcmanus]
:
Mentors:
Depends on: 650282 701362
Blocks: 648941
  Show dependency treegraph
 
Reported: 2011-01-24 18:47 PST by xiaqqaix
Modified: 2016-04-05 14:27 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch v1 (5.80 KB, patch)
2011-01-26 14:38 PST, Patrick McManus [:mcmanus]
bzbarsky: review+
Details | Diff | Splinter Review
patch v2; r=bz (5.79 KB, patch)
2011-01-27 06:34 PST, Patrick McManus [:mcmanus]
mcmanus: review+
Details | Diff | Splinter Review
patch v3 (7.50 KB, patch)
2011-07-21 14:02 PDT, Patrick McManus [:mcmanus]
bzbarsky: review+
Details | Diff | Splinter Review

Description xiaqqaix 2011-01-24 18:47:12 PST
User-Agent:       Mozilla/5.0 (X11; Linux i686; rv:2.0b10pre) Gecko/20110120 Firefox-4.0/4.0b10pre
Build Identifier: Mozilla/5.0 (X11; Linux i686; rv:2.0b10pre) Gecko/20110120 Firefox-4.0/4.0b10pre

Some public networks require user name and password for access, and they do this by hijacking the DNS, that is, any DNS requests would be redirected to the login web page. After logging in Internet access is granted, and the redirection is no longer necessary. However, Firefox caches the old DNS redirection, and the misbehavior requires a restart of Firefox to fix.

Although I'm talking about Firefox, I believe that the bug should belong to "Core", as I found the Firefox section doesn't have a component "Networking: Cache" or like but this section has.


Reproducible: Always

Steps to Reproduce:
1. Connect to some public network described above.
2. Navigate Firefox to a certain URL, say, "http://www.google.com"
3. Firefox would be redirected to the login page.
4. Enter the correct user name and password to log in.
5. Navigate Firefox to the same URL as in step 2.
Actual Results:  
Firefox would be redirected to the login page again.

Expected Results:  
Firefox should go to the un-redirected place.

At least a force reload should lead Firefox to the un-redirected place.

Only the URL in step 2 gets incorrectly redirected. Any other URL works.

A "force reload" doesn't help. I have tried (Ctrl|Shift|Alt) + (Reload button|F5); none works. Seems the force reload doesn't reload the DNS cache.

This could lead to substantial security problems, if the public network is malicious.
Comment 1 Boris Zbarsky [:bz] 2011-01-24 20:19:23 PST
"Cache" is the HTTP cache.

How could this lead to security problems that aren't present without DNS caching, exactly?
Comment 2 xiaqqaix 2011-01-24 21:01:24 PST
(In reply to comment #1)
> "Cache" is the HTTP cache.
> 
> How could this lead to security problems that aren't present without DNS
> caching, exactly?

Thanks for the prompt reply. After some rethink the security concern turned out to be a false perception. As for the category, it would be appreciated if anyone with the permission could move this report to the more appropriate section.
Comment 3 Boris Zbarsky [:bz] 2011-01-24 21:05:13 PST
I already did the moving.  ;)

Thanks for clarifying about the security concerns!
Comment 4 Patrick McManus [:mcmanus] 2011-01-25 13:04:04 PST
the force reload does force the dns subsystem to refresh the dns cache: 
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#3632

unfortunately, what I think is going on here is that your reload is using an idle persistent connection which is indexed under the host name.. so no call to the dns system is made (and thus no refresh is done). I was able to reproduce that.

The best thing I can think of to do is when we get a force reload and are about to reuse an idle persistent connection to check that the DNS matches the peer address and if doesn't (which 99.9% of the time is going to be because of a legit LB thing not the "hijack is repaired" scenario here) then update the DNS cache and close out all the existing pconns so as to use the new server. It will slow things down un-necessarily in some cases, but verifiable correctness is sort of the point of forced reload.
Comment 5 Patrick McManus [:mcmanus] 2011-01-26 14:38:23 PST
Created attachment 507259 [details] [diff] [review]
patch v1

After sleeping on this, I decided to take a more direct approach. Forced reloads (ctrl f5, shift reload icon, or shift ctrl r) are very strong signals that we want to resync things from scratch.

so this patch responds to a forced reload by closing out the persistent connection pool for the related host. new connections will be opened and new DNS lookups will happen. There is a performance penality - but more or less that is what you mean when you do the forced reload

normal reloads are un-effected.

this is a long standing issue, we don't necessarily have to add the fix to necko 2.0 imo
Comment 6 Boris Zbarsky [:bz] 2011-01-26 19:11:53 PST
Comment on attachment 507259 [details] [diff] [review]
patch v1

You don't need the "see bug" bit; the vcs blame will show that.  r=me with that line gone.
Comment 7 Patrick McManus [:mcmanus] 2011-01-27 06:34:57 PST
Created attachment 507441 [details] [diff] [review]
patch v2; r=bz

reflects comment 6.

will land post ff 4.0 barring other instruction.
Comment 8 Patrick McManus [:mcmanus] 2011-01-27 06:47:35 PST
(In reply to comment #6)

> 
> You don't need the "see bug" bit; the vcs blame will show that.

I made the change to the patch, but I wanted to follow up on this.

over 1100 of the 1233 lines of nshttptransaction.cpp are uselessly blamed on change d8dc49d5bd609668b3c4fadd6c1df12d5da20547 (rearrange makefiles and directory heirarchy). Anecdotally, that's also true of at least most of the files in netwerk/protocol/http.

changes to whitespace and minor changes in logic (e.g adding an ||) also obscure blame history. Blame is fragile.

The source should be self describing wrt motivation, but pointers to further reading are a good thing especially to stable sources like bugzilla.

But hey, I cringe at the 80 column rule too :) I can cope.
Comment 9 Boris Zbarsky [:bz] 2011-01-27 08:57:29 PST
The blame mess with necko is a problem, yes; we have people working on hg to fix that.  ;)

And yes, sometimes you have to follow blame a few changes back in time.

But if we added bug pointers for every checkin, we'd right now have several hundred thousand of them in our codebase.  That's one every 10 lines of code, on average.  That doesn't really scale, unfortunately.
Comment 10 Patrick McManus [:mcmanus] 2011-03-24 11:01:10 PDT
http://hg.mozilla.org/mozilla-central/rev/d5fad1ab2f00
Comment 11 Joe Drew (not getting mail) 2011-03-24 17:34:10 PDT
Backed out on suspicion of causing tp4 regression: http://hg.mozilla.org/mozilla-central/rev/b17599b51fd7
Comment 12 Patrick McManus [:mcmanus] 2011-04-03 08:46:43 PDT
http://perf.snarkfest.net/compare-talos/breakdown.html?oldTestIds=6864069,6864861,6866327,6867825&newTestIds=6869662,6869775&testName=tp4

this is the source of the tp4 issue.. it does clear the cache, but I didn't think any of the tests were using external resources?
Comment 13 Boris Zbarsky [:bz] 2011-04-04 16:35:40 PDT
Also, tp4 shouldn't be doing force-reloads...
Comment 14 Kelley Cook 2011-05-19 13:07:22 PDT
Aren't the post-patch results of this tp4 "regression" to be expected?  Some of those pages have multiple persistent connections and the test purposefully resets each of them 10 times.  Now, unlike before, they are actually being completely reset, as one would intuitively expect -- see Comment 5.  Pre-patch all those connections would have remained open though it was reloading each of them.

I'd vote for fixing the test either by not having it force-reload between each of 10 page refreshes ... or just accepting the bad looking, but more accurate results.

Ctrl-F5 is a supposed to be a full reset, so it should be somewhat slow, otherwise its just gaming the system.
Comment 15 Boris Zbarsky [:bz] 2011-05-19 13:11:20 PDT
> and the test purposefully resets each of them 10 times.

Where do you see that happening, exactly?

> Ctrl-F5 is a supposed to be a full reset

tp4 does NOT do that.  It just loads web pages the same as you would by typing in the url bar and hitting enter.
Comment 16 Patrick McManus [:mcmanus] 2011-05-19 13:24:10 PDT
(In reply to comment #15)
> > and the test purposefully resets each of them 10 times.
> 
> Where do you see that happening, exactly?
> 
> > Ctrl-F5 is a supposed to be a full reset
> 
> tp4 does NOT do that.  It just loads web pages the same as you would by
> typing in the url bar and hitting enter.

First, Kelley - thanks for taking the time to think about this.. I haven't forgotten about it but it is sitting behind some other higher priority things.

Boris, I did get as far as logging on the code that is supposed to match up to the ^F5 and it did get hit when running tp4. I did not get a chance to dig into why the heck that was happening - that will be my next step. 

and lastly, when I fixed a bug in syn-retry a couple weeks ago a bunch of talos numbers on the win-7 platform improved quite unexpectedly.. it stands to reason that this patch, which can result in having to make a bunch of new connections, would have run into that problem disproportionately on that platform.. if so, its possible that this can just land because the other bug is fixed. but that's just conjecture at this point until I have a chance to run it down.
Comment 17 Boris Zbarsky [:bz] 2011-05-19 13:36:41 PDT
> and it did get hit when running tp4

Er... why?  That seems wrong; what did the stack look like?
Comment 18 Patrick McManus [:mcmanus] 2011-05-19 13:42:30 PDT
(In reply to comment #17)
> > and it did get hit when running tp4
> 
> Er... why?  That seems wrong; what did the stack look like?

as I said, digging into it is in my queue somewhere. I don't think I even got a stack - just a log line which I didn't expect to be in the output :) I only mentioned it now because this is obviously tieing up space in kelley's brain :)
Comment 19 Patrick McManus [:mcmanus] 2011-07-21 09:57:16 PDT
(In reply to comment #13)
> Also, tp4 shouldn't be doing force-reloads...

LOAD_BYPASS_CACHE is used by force-reload.. it turns out in the case of tp4, it is also used by an xhr post of some sort.

I'll update the patch to use a new LOAD_FRESH_CONNECTION and have docshell set LOAD_BYPASS_CACHE | LOAD_FRESH_CONNECTION on the force-reload.
Comment 20 Patrick McManus [:mcmanus] 2011-07-21 14:02:08 PDT
Created attachment 547506 [details] [diff] [review]
patch v3

the tp4 (still available on try) regression clears up with this version. Makes more sense than overloading bypass_cache imo.

The words captive portal should be in this bug for search engine purpose.
Comment 21 Boris Zbarsky [:bz] 2011-07-21 14:10:22 PDT
Hmm.  Would it make more sense for XHR to use BYPASS_LOCAL_CACHE instead of BYPASS_CACHE in the POST case?  What it's doing right now is making sure to read from the server but still storing the data in our cache...  Jonas?
Comment 22 Boris Zbarsky [:bz] 2011-07-21 14:14:06 PDT
Comment on attachment 547506 [details] [diff] [review]
patch v3

That said, I think I agree that we may want to have a separate flag for this, because overloading LOAD_BYPASS_CACHE is not great.  But should we perhaps leave it out of LOAD_REQUESTMASK?  Otherwise each subresource on a page being shift-reloaded will end up closing idle connections, right?
Comment 23 Boris Zbarsky [:bz] 2011-07-21 14:14:37 PDT
And I'm also a little bit worried about running out of nsIRequest flag bits, but not sure what to do about it yet.
Comment 24 Jonas Sicking (:sicking) PTO Until July 5th 2011-07-21 14:41:23 PDT
Sorry, don't know enough about the issues here to have an immediate opinion. Most of what I know is based on comments in the XHR code.
Comment 25 Boris Zbarsky [:bz] 2011-07-21 18:56:40 PDT
The basic issue is whether we want XHR POST to bypass the cache only in terms of not reading from it (but still write to it) or whether it should skip the cache altogether.  But that's fodder for as separate bug anyway....
Comment 26 Patrick McManus [:mcmanus] 2011-07-22 05:52:27 PDT
(In reply to comment #22)
> Comment on attachment 547506 [details] [diff] [review] [review]
> patch v3
> 
> That said, I think I agree that we may want to have a separate flag for
> this, because overloading LOAD_BYPASS_CACHE is not great.  But should we
> perhaps leave it out of LOAD_REQUESTMASK?  Otherwise each subresource on a
> page being shift-reloaded will end up closing idle connections, right?

The closing of connections is done on a per-host basis and the subresources don't have to share the same host, so I think we're doing the right thing. It does mean that every transaction in the load group will get its own connection - but for the extremis of force-reload I don't think that's wrong.

It would be kind of odd to have existing connections to subresources that are out of sync because of a captive portal - but its not impossible in mobile scenarios.
Comment 27 Patrick McManus [:mcmanus] 2011-07-22 05:56:33 PDT
(In reply to comment #23)
> And I'm also a little bit worried about running out of nsIRequest flag bits,
> but not sure what to do about it yet.

perhaps bits 1 through 6 can be reclaimed?
Comment 28 Boris Zbarsky [:bz] 2011-07-22 07:21:10 PDT
Oh, I missed that 1-6 are unused.  OK, good.
Comment 29 Boris Zbarsky [:bz] 2011-07-22 08:32:26 PDT
Comment on attachment 547506 [details] [diff] [review]
patch v3

r=me
Comment 30 Marco Bonardo [::mak] (Away 6-20 Aug) 2011-07-25 05:48:21 PDT
http://hg.mozilla.org/mozilla-central/rev/9ff13e309d9e

Note You need to log in before you can comment on or make changes to this bug.