Closed Bug 628561 Opened 14 years ago Closed 13 years ago

DNS caching forcing one to restart Firefox to eliminate temporary DNS redirection

Categories

(Core :: Networking, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla8

People

(Reporter: xiaqqaix, Assigned: mcmanus)

References

Details

(Whiteboard: [inbound])

Attachments

(1 file, 2 obsolete files)

User-Agent: Mozilla/5.0 (X11; Linux i686; rv:2.0b10pre) Gecko/20110120 Firefox-4.0/4.0b10pre Build Identifier: Mozilla/5.0 (X11; Linux i686; rv:2.0b10pre) Gecko/20110120 Firefox-4.0/4.0b10pre Some public networks require user name and password for access, and they do this by hijacking the DNS, that is, any DNS requests would be redirected to the login web page. After logging in Internet access is granted, and the redirection is no longer necessary. However, Firefox caches the old DNS redirection, and the misbehavior requires a restart of Firefox to fix. Although I'm talking about Firefox, I believe that the bug should belong to "Core", as I found the Firefox section doesn't have a component "Networking: Cache" or like but this section has. Reproducible: Always Steps to Reproduce: 1. Connect to some public network described above. 2. Navigate Firefox to a certain URL, say, "http://www.google.com" 3. Firefox would be redirected to the login page. 4. Enter the correct user name and password to log in. 5. Navigate Firefox to the same URL as in step 2. Actual Results: Firefox would be redirected to the login page again. Expected Results: Firefox should go to the un-redirected place. At least a force reload should lead Firefox to the un-redirected place. Only the URL in step 2 gets incorrectly redirected. Any other URL works. A "force reload" doesn't help. I have tried (Ctrl|Shift|Alt) + (Reload button|F5); none works. Seems the force reload doesn't reload the DNS cache. This could lead to substantial security problems, if the public network is malicious.
"Cache" is the HTTP cache. How could this lead to security problems that aren't present without DNS caching, exactly?
Component: Networking: Cache → Networking
QA Contact: networking.cache → networking
(In reply to comment #1) > "Cache" is the HTTP cache. > > How could this lead to security problems that aren't present without DNS > caching, exactly? Thanks for the prompt reply. After some rethink the security concern turned out to be a false perception. As for the category, it would be appreciated if anyone with the permission could move this report to the more appropriate section.
I already did the moving. ;) Thanks for clarifying about the security concerns!
Group: core-security
the force reload does force the dns subsystem to refresh the dns cache: http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#3632 unfortunately, what I think is going on here is that your reload is using an idle persistent connection which is indexed under the host name.. so no call to the dns system is made (and thus no refresh is done). I was able to reproduce that. The best thing I can think of to do is when we get a force reload and are about to reuse an idle persistent connection to check that the DNS matches the peer address and if doesn't (which 99.9% of the time is going to be because of a legit LB thing not the "hijack is repaired" scenario here) then update the DNS cache and close out all the existing pconns so as to use the new server. It will slow things down un-necessarily in some cases, but verifiable correctness is sort of the point of forced reload.
Attached patch patch v1 (obsolete) — Splinter Review
After sleeping on this, I decided to take a more direct approach. Forced reloads (ctrl f5, shift reload icon, or shift ctrl r) are very strong signals that we want to resync things from scratch. so this patch responds to a forced reload by closing out the persistent connection pool for the related host. new connections will be opened and new DNS lookups will happen. There is a performance penality - but more or less that is what you mean when you do the forced reload normal reloads are un-effected. this is a long standing issue, we don't necessarily have to add the fix to necko 2.0 imo
Assignee: nobody → mcmanus
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #507259 - Flags: review?(bzbarsky)
Comment on attachment 507259 [details] [diff] [review] patch v1 You don't need the "see bug" bit; the vcs blame will show that. r=me with that line gone.
Attachment #507259 - Flags: review?(bzbarsky) → review+
Attached patch patch v2; r=bz (obsolete) — Splinter Review
reflects comment 6. will land post ff 4.0 barring other instruction.
Attachment #507259 - Attachment is obsolete: true
Attachment #507441 - Flags: review+
(In reply to comment #6) > > You don't need the "see bug" bit; the vcs blame will show that. I made the change to the patch, but I wanted to follow up on this. over 1100 of the 1233 lines of nshttptransaction.cpp are uselessly blamed on change d8dc49d5bd609668b3c4fadd6c1df12d5da20547 (rearrange makefiles and directory heirarchy). Anecdotally, that's also true of at least most of the files in netwerk/protocol/http. changes to whitespace and minor changes in logic (e.g adding an ||) also obscure blame history. Blame is fragile. The source should be self describing wrt motivation, but pointers to further reading are a good thing especially to stable sources like bugzilla. But hey, I cringe at the 80 column rule too :) I can cope.
The blame mess with necko is a problem, yes; we have people working on hg to fix that. ;) And yes, sometimes you have to follow blame a few changes back in time. But if we added bug pointers for every checkin, we'd right now have several hundred thousand of them in our codebase. That's one every 10 lines of code, on average. That doesn't really scale, unfortunately.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Backed out on suspicion of causing tp4 regression: http://hg.mozilla.org/mozilla-central/rev/b17599b51fd7
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
http://perf.snarkfest.net/compare-talos/breakdown.html?oldTestIds=6864069,6864861,6866327,6867825&newTestIds=6869662,6869775&testName=tp4 this is the source of the tp4 issue.. it does clear the cache, but I didn't think any of the tests were using external resources?
Also, tp4 shouldn't be doing force-reloads...
Depends on: 650282
Blocks: 648941
Aren't the post-patch results of this tp4 "regression" to be expected? Some of those pages have multiple persistent connections and the test purposefully resets each of them 10 times. Now, unlike before, they are actually being completely reset, as one would intuitively expect -- see Comment 5. Pre-patch all those connections would have remained open though it was reloading each of them. I'd vote for fixing the test either by not having it force-reload between each of 10 page refreshes ... or just accepting the bad looking, but more accurate results. Ctrl-F5 is a supposed to be a full reset, so it should be somewhat slow, otherwise its just gaming the system.
> and the test purposefully resets each of them 10 times. Where do you see that happening, exactly? > Ctrl-F5 is a supposed to be a full reset tp4 does NOT do that. It just loads web pages the same as you would by typing in the url bar and hitting enter.
(In reply to comment #15) > > and the test purposefully resets each of them 10 times. > > Where do you see that happening, exactly? > > > Ctrl-F5 is a supposed to be a full reset > > tp4 does NOT do that. It just loads web pages the same as you would by > typing in the url bar and hitting enter. First, Kelley - thanks for taking the time to think about this.. I haven't forgotten about it but it is sitting behind some other higher priority things. Boris, I did get as far as logging on the code that is supposed to match up to the ^F5 and it did get hit when running tp4. I did not get a chance to dig into why the heck that was happening - that will be my next step. and lastly, when I fixed a bug in syn-retry a couple weeks ago a bunch of talos numbers on the win-7 platform improved quite unexpectedly.. it stands to reason that this patch, which can result in having to make a bunch of new connections, would have run into that problem disproportionately on that platform.. if so, its possible that this can just land because the other bug is fixed. but that's just conjecture at this point until I have a chance to run it down.
> and it did get hit when running tp4 Er... why? That seems wrong; what did the stack look like?
(In reply to comment #17) > > and it did get hit when running tp4 > > Er... why? That seems wrong; what did the stack look like? as I said, digging into it is in my queue somewhere. I don't think I even got a stack - just a log line which I didn't expect to be in the output :) I only mentioned it now because this is obviously tieing up space in kelley's brain :)
(In reply to comment #13) > Also, tp4 shouldn't be doing force-reloads... LOAD_BYPASS_CACHE is used by force-reload.. it turns out in the case of tp4, it is also used by an xhr post of some sort. I'll update the patch to use a new LOAD_FRESH_CONNECTION and have docshell set LOAD_BYPASS_CACHE | LOAD_FRESH_CONNECTION on the force-reload.
Attached patch patch v3Splinter Review
the tp4 (still available on try) regression clears up with this version. Makes more sense than overloading bypass_cache imo. The words captive portal should be in this bug for search engine purpose.
Attachment #507441 - Attachment is obsolete: true
Attachment #547506 - Flags: review?(bzbarsky)
Hmm. Would it make more sense for XHR to use BYPASS_LOCAL_CACHE instead of BYPASS_CACHE in the POST case? What it's doing right now is making sure to read from the server but still storing the data in our cache... Jonas?
Comment on attachment 547506 [details] [diff] [review] patch v3 That said, I think I agree that we may want to have a separate flag for this, because overloading LOAD_BYPASS_CACHE is not great. But should we perhaps leave it out of LOAD_REQUESTMASK? Otherwise each subresource on a page being shift-reloaded will end up closing idle connections, right?
And I'm also a little bit worried about running out of nsIRequest flag bits, but not sure what to do about it yet.
Sorry, don't know enough about the issues here to have an immediate opinion. Most of what I know is based on comments in the XHR code.
The basic issue is whether we want XHR POST to bypass the cache only in terms of not reading from it (but still write to it) or whether it should skip the cache altogether. But that's fodder for as separate bug anyway....
(In reply to comment #22) > Comment on attachment 547506 [details] [diff] [review] [review] > patch v3 > > That said, I think I agree that we may want to have a separate flag for > this, because overloading LOAD_BYPASS_CACHE is not great. But should we > perhaps leave it out of LOAD_REQUESTMASK? Otherwise each subresource on a > page being shift-reloaded will end up closing idle connections, right? The closing of connections is done on a per-host basis and the subresources don't have to share the same host, so I think we're doing the right thing. It does mean that every transaction in the load group will get its own connection - but for the extremis of force-reload I don't think that's wrong. It would be kind of odd to have existing connections to subresources that are out of sync because of a captive portal - but its not impossible in mobile scenarios.
(In reply to comment #23) > And I'm also a little bit worried about running out of nsIRequest flag bits, > but not sure what to do about it yet. perhaps bits 1 through 6 can be reclaimed?
Oh, I missed that 1-6 are unused. OK, good.
Comment on attachment 547506 [details] [diff] [review] patch v3 r=me
Attachment #547506 - Flags: review?(bzbarsky) → review+
Whiteboard: [inbound]
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Depends on: 701362
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: