Closed
Bug 641937
Opened 14 years ago
Closed 14 years ago
Improper hostname to IP address handling when multiple IP addresses returned via DNS
Categories
(Core :: Networking, defect)
Tracking
()
RESOLVED
FIXED
mozilla8
People
(Reporter: brad, Assigned: mcmanus)
References
()
Details
Attachments
(4 files, 4 obsolete files)
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7 ( .NET CLR 3.5.30729)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT
If an HTTP server has multiple IP addresses associated with it in DNS, Firefox does not handle it correctly.
If multiple "A" records are returned, Firefox will attempt to connect to the first address in the list. It appears as though it will time-out this request after 20 seconds, and correctly fall-back to the second address returned via. DNS. This will result in the asset (for example, the initial "outer" HTTP page) being read correctly.
However, when the browser then attempts to read additional assets within the page (style-sheets, scripts, images, etc), it will fall back to the *first* (non-responding) address again.
However, this time, the initial "set" of assets it is loading will now be subject to a longer timeout. (40 seconds).
After this batch finally times-out, the assets are again loaded by falling back to the secondary IP address.
The "third" bunch of assets is subject to an even longer timeout - (60-63 seconds).
So there are a few things going wrong here:
1. After the address completely fails to respond - it should use, and stay with the second address. At least for a "reasonable" amount of time.
2. I don't know what the logic is behind the continually increasing timeout, but it make the problem even worse.
I have tried similar tests with Internet Explorer, and it does not exhibit the problem. For example, upon the initial page load, it will attempt the first IP address in the DNS response. Once this has timed out, all further requests made by the browser to the hostname will be sent to the second name in the DNS list.
I tested and observed this behavior several times. When I subsequently went back and re-tested later, it was able to load pages quickly.
Other thoughts:
I don't know exactly what the "proper" way to handle this is. Some schools of thought are:
1. I would imagine in doing a *single* page load, as I have, that the client should have kept open a persistent HTTP connection, thus reducing the need to initiate more and more connections. There were no instances where *some* of the assets in the second round loaded quickly (because there was a persistent connection open) and some didn't (because more connections needed to be opened.)
2. When receiving multiple IP addresses, choose one at random, every time a persistent HTTP connection is opened. This would help balance traffic amongst multiple servers. (Again, this is very arguable, and I don't know if is "correct").
3. I don't know why this kept constantly happening, and then all of the sudden started to work correctly. Could it be that there is some point/threshold at which it will "give up" on the failed address and "remember" the other? Perhaps this threshold itself needs to be fixed. Is it after the initial page load? If so, that is not good enough, because multiple assets on a single page could make this take an *INCREDEBLY* long time.
But in any case, if one address completely fails, it should just continue to use others.
Reproducible: Always
Steps to Reproduce:
1. Create DNS records in an authoritative name server for a machine. Make the first "A" record point to a bad machine which is inaccessible (for testing purposes), and make the second "A" record point to a valid, running machine.
2. Attempt to open a web page on the server, using this name. A typical one, maybe with a few dozen images in it - enough so that Firefox will have to open several concurrent connections to read them all - and go back a few times to read them all.
3. Observe asset request and network load times via. Firebug, and/or actual requests via Wireshark.
Actual Results:
Page takes an INCREDIBLY long time to open. You can see the initial web page after 20 seconds. You can then see some of the images on the page after an additional 40 second. You can see even more images after an additional minute, etc.
Expected Results:
The page should take a moderately long time to open. After entering the URL, the page should appear to "hang" for about 20 seconds (until the first server times-out), and then should load (against the secondary server) with normal speed.
Reporter | ||
Comment 1•14 years ago
|
||
Shows the assets progressively loading. Note that in the first "batch", just the main HTML page is loaded. In the second batch, the JavaScript files are loaded. In the third batch, both images and an ShockWave (Flash app) are loaded, as were requested from the aforementioned JavaScript scripts. After the Flash app is loaded, it goes on to load several more images which make a fourth "batch". Then, video fragments are continual loaded (by the Flash plugin) - these each take long timeouts as well.
Reporter | ||
Comment 2•14 years ago
|
||
I indicated before that the problem "went away". Well, it magically re-appeared just as quickly as it went away.
I don't know if Firefox *does* remember that the first address is bad - it just does it only after the first page is *completley loaded*. Which in this case, can take so long that no normal user would ever wait for it to load. Talking on the order or 5 or 10 minutes. (They'd hit "refresh", or just give up).
Perhaps it is closing and re-opening the browser which will make it happen again?
Also, when we tried this on Linux, running:
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.12) Gecko/2009072711 CentOS/3.0.12-1.el5.centos Firefox/3.0.12
The problem did not happen - at all. No timeout on initial page load. This tells me that it never even *tried* the first address - like it went right for the second. Maybe there is some round-robin or random logic in there somewhere...
Comment 3•14 years ago
|
||
> that the client should have kept open a persistent HTTP connection,
We do use persistent connections in general (though note that we often kick off subrequests before the page load itself has finished, so they can't go on the same connection as the page itself).
> When receiving multiple IP addresses, choose one at random
This would actually break load-balancers which return IPs in round-robin order; a pretty common thing to do. I'm pretty sure we look at the IPs in order for that reason (though who knows what the ancient 1.9.0 builds you tried on Linux are doing!).
From a brief skim of our code the only place we "remember" that an IP failed is in nsDNSRecord (or more precisely in the nsHostRecord it's holding); see the GetNextAddr call in nsSocketTransport::RecoverFromError.
nsHostRecords are cached; the cache size is capped at 400 hostnames by default but is controlled by the "network.dnsCacheEntries" preference. Just to check, do you have that preference set to a non-default value?
Reporter | ||
Comment 4•14 years ago
|
||
> This would actually break load-balancers which return IPs in round-robin order;
a pretty common thing to do. I'm pretty sure we look at the IPs in order for
that reason (though who knows what the ancient 1.9.0 builds you tried on Linux
are doing!).
Yes, I believe the Linux ones are probably doing something different. Note that Internet Explorer seems to do the same, though. I don't know that this would necessarily *break* a round-robin DNS load balancer - but would perhaps add another level of entropy over it.
> Just to check, do you have that preference set to a non-default value?
I have made no modifications to Firefox's preferences/settings whatsoever. Also, the entire extent of my testing was done within a very small time period. i.e. I would not have overflowed a 400 hostname limit between my failed and successful tests.
> From a brief skim of our code the only place we "remember" that an IP failed is in nsDNSRecord (or more precisely in the nsHostRecord it's holding); see the GetNextAddr call in nsSocketTransport::RecoverFromError.
So it *appears* (and I emphasize this because I'm not sure) - that this is "remembered" only after the entire page load completes. (i.e. the entire page - including scripts, images, etc). If you abort it because it took ten minutes - or refresh in the middle - it does not get remembered. I know I mentioned this earlier, but further testing seemed to back up this assumption.
Comment 5•14 years ago
|
||
> I have made no modifications to Firefox's preferences/settings whatsoever.
OK, but could you still check the value for me? Open up about:config and type the preference name in the filter field. Does anything show up?
> that this is "remembered" only after the entire page load completes.
The code I referenced moves to the next IP in the nsHostRecord as soon as we hit an error and fail over; the act of moving to the next IP remembers that the first one is bogus (it's a stateful iterator)...
One other thing worth checking. On the machine where you can reproduce the problem, could you reproduce it while logging what our DNS subsystem thinks it's doing? https://developer.mozilla.org/En/HTTP_Logging has the directions, but you want to set NSPR_LOG_MODULES to "nsHostResolver:5" (without the quotes). Then attach the log to this bug.
Reporter | ||
Comment 6•14 years ago
|
||
> OK, but could you still check the ["network.dnsCacheEntries"] value for me?
I see no "network.dnsCacheEntries" setting. The only DNS entries I see are "network.dns.disableIPv6", "network.dns.ipv4OnlyDomains", and "network.proxy.socks_remote_dns".
Reporter | ||
Comment 7•14 years ago
|
||
As requested.
Comment 8•14 years ago
|
||
And the relevant host is badbalancer.seadn.info, right?
That's coming from the cache the whole time there, so we shouldn't be falling back to the broken IP again...
Patrick, do you want to take a look at this?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reporter | ||
Comment 9•14 years ago
|
||
This is a firebug network trace which was taken for the same session,and will correlate to the previous Firefox log attachment Attachment #519633 [details].
Reporter | ||
Comment 10•14 years ago
|
||
Note that after the last test, after the page completed, (after I snapshotted the log) - I tried to refresh the page, and the problem still persisted. (i.e. So much for the theory of the state being remembered after the page successfully, completely loads.)
Reporter | ||
Comment 11•14 years ago
|
||
(In reply to comment #8)
> And the relevant host is badbalancer.seadn.info, right?
Correct.
> That's coming from the cache the whole time there, so we shouldn't be falling
> back to the broken IP again...
Yes, but we are.
Assignee | ||
Comment 12•14 years ago
|
||
(In reply to comment #8)
>
> Patrick, do you want to take a look at this?
yep. It may have to sit in my queue for a few days though..
Assignee: nobody → mcmanus
Reporter | ||
Comment 13•14 years ago
|
||
FYI - Chrome appears to choose a DNS entry (from the ones given) at random. When you refresh it picks another. If/when it determines one is dead, it will not go back to it.
Assignee | ||
Comment 14•14 years ago
|
||
(In reply to comment #7)
> Created attachment 519633 [details]
> Reproduce problem - Firefox Log file with NSPR_LOG_MODULES=nsHostResolver:5
>
> As requested.
Hi Brad, what role does [host=proxy.seadn.info] play in your STR?
Thanks.
Reporter | ||
Comment 15•14 years ago
|
||
The main web page that is opened here, has many <a ref="xx"> links on it to other pages. Several of them are to proxy.seadn.info, which resolves to a valid server here, in house. (It is going through a GeoDNS setup, so it may or may not resolve to someplace valid for you). The same goes for several other hosts I am seeing in that log file, such as "hydra", "umg1", "umg2", and "umg3".
I don't know why Firefox would be even resolving those names though, as they are just <a> links in the page source - i.e. I have not clicked on them - and it's not like there are is any content from the web page being actually loaded (i.e. images, scripts, style-sheets, etc). Does Firefox try to pre-load any of this type of stuff?
"badbalancer.seadn.info" should be the only server really involved here. (As shown in the Firebug output). That's where the Main page, stylesheets, scripts, and images all are pulled from.
Comment 16•14 years ago
|
||
> I don't know why Firefox would be even resolving those names though
See https://developer.mozilla.org/En/Controlling_DNS_prefetching
Assignee | ||
Comment 17•14 years ago
|
||
(In reply to comment #15)
> The main web page that is opened here, has many <a ref="xx"> links on it to
> other pages. Several of them are to proxy.seadn.info, which resolves to a valid
ok, thanks.. I just wanted to make sure you weren't proxying the http out of firefox.
> I don't know why Firefox would be even resolving those names though,
as boris says, that's just dns prefetch at work, so you don't have to wait when you do click on them.
Assignee | ||
Comment 18•14 years ago
|
||
(In reply to comment #3)
>
> From a brief skim of our code the only place we "remember" that an IP failed is
> in nsDNSRecord (or more precisely in the nsHostRecord it's holding); see the
> GetNextAddr call in nsSocketTransport::RecoverFromError.
It seems that the only place we do remember this state is in the nsDNSRecord, not the nsHostRecord. (specifically mIter in nsDNSrecord) And there is a new nsDNSRecord for every lookup including cache hits.. its generally created in onLookupComplete.
I've created an RR where the first A record is reachable and the second is not, and I never see necko try to connect to the bad one.
This has the advantage that all our connections to the same host will resolve to the same address (at least as long as the hostrecord doesn't change). And that stickyness is widely viewed as a good thing for server applications - they can get their load balancing at a granularity of different browsers. As long as we're doing that I don't want to lose that property.
OTOH it means that every lookup has to relearn independently which members of nshostrecord are unreachable.. which is probably the source of this bug..
I'm not sure that simply calling getnextaddr ought to be the cue to mark something as universally bad though - just because something cannot be connected to on one port (and thus triggers getnextaddr()) doesn't mean all ports associated with the IP are useless. but then again a full srv map and reporting interface seems like a pita for little practical benefit.. I can certainly implement the getnextaddr trigger which will be an improvement and we can see if its good enough.
boris - thoughts?
Reporter | ||
Comment 19•14 years ago
|
||
So - I believe the precise reason for allowing multiple DNS entries is in-fact to allow for a failover in the event of an entry fails. (The search order in this list is arguable, I think, but that isn't principle).
So the getnextaddr fix seems to be a good idea - when it is called, "forget" the "first" entry in the DNS cache [list].
I would also argue (though this, I am sure would be much more debatable.) that even after the DNS entry expires in cache and is refreshed, the "original/bad" address should not be used - or should be subject to a quicker timeout. (I know this is probably making you cringe). My thought is that if I were to set my DNS timeout to "X" seconds, that would mean I would be subject to a twenty-second failover time every "X" seconds. IE seems to timeout much quicker, and remember it after it does.
I liked the SVR map idea - but as it is so frequently unused, we dropped the idea.
BTW - We are actually developing server systems - and are sort of taking the opposite position of what you are saying - I think - in that we see that it is the *client* which needs to take a bit more of an active role in failover. Firefox is actually the only browser we have tested that didn't do what we thought it should/what we would have liked.
Just my 2-cents...
Assignee | ||
Comment 20•14 years ago
|
||
(In reply to comment #18)
> (In reply to comment #3)
> > I can
> certainly implement the getnextaddr trigger which will be an improvement and we
> can see if its good enough.
>
on second thought: meh. nsdnsrecord is too much like an iterator to hook getnextaddr without surprising people with the side effect.
barring objection, I'll add a new method to the interface to report the last address out as unreachable.. and make that sticky on the underlying host record. we'll still return those addresses if all the addresses in the record have been reported bad. extra credit if this info can persist across successful revals of the hostrecord (which would be easy if we actually had a concept of reval, but we don't have access to the real rr - but we can fake it with some comparisons).
reporting something as unreachable only needs to be called from two places afaik (socket transport and socksiolayer.. websockets uses this for a different reason which is probably a good reason not to overload getnextaddr()).
Comment 21•14 years ago
|
||
> I'll add a new method to the interface to report the last address out as
> unreachable.. and make that sticky on the underlying host record.
Sounds good to me. Good catch on mIter storing all its state internally, not in the nsHostRecord.
Assignee | ||
Comment 22•14 years ago
|
||
ok, here's a patch. I've sent it off to try for a sanity test.. when it achieves green there Brad I will ask that you then verify it with a build from the Try builder, and once we've got that nailed down I'll ask Boris to review.
Your initial set of connects will still be dreadfully slow - we should time them out faster, but that's a different non DNS bug. But once things start working they ought to stay that way once all the dead addresses have been identified.
The patch ends up keeping a list of blacklisted addresses, represented as strings, associated with the host record. These are then filtered out of the getnextaddr() responses as long as there are >0 non-filtered responses available, otherwise the filter gets reset.
This has some advantages: there are no timeouts to mangle with. Given that none of the addresses are inherently better than the other addresses we can just roll with the failure list for a long time (or until they get pushed out of the dns cache, which is more likely)... the list is independent of dns refreshes, ipv[46] etc, .. which is definitely good. Perhaps as importantly, it pushes all the cost down to the cases where failures have been reported - the common case is basically untouched.
Assignee | ||
Comment 23•14 years ago
|
||
forgot to bump uuid of idl
Attachment #520972 -
Attachment is obsolete: true
Reporter | ||
Comment 24•14 years ago
|
||
Sounds great! Let me know when, and I will test.
Thank you!
-BKG
Assignee | ||
Comment 25•14 years ago
|
||
(In reply to comment #24)
> Sounds great! Let me know when, and I will test.
>
You filed the bug under platform=windows-xp, so I made a windows build:
http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/mcmanus@ducksong.com-3a2919754a10/try-w32/firefox-4.0b13pre.en-US.win32.zip
Let me know if it helps, like I said - the initial connections may still take a long time to timeout (depending on exactly why they fail).
Assignee | ||
Comment 26•14 years ago
|
||
(In reply to comment #24)
> Sounds great! Let me know when, and I will test.
>
any feedback here?
Reporter | ||
Comment 27•14 years ago
|
||
D'oh! Sorry - I did the test and got distracted afterwards - thought that I had responded! (Still half think I did?) Anyways...
The release worked great! I got an initial 20 second timeout (as expected), afterward the entire page loaded quickly. Subsequent loads from pages of that site were immediate. After exiting Firefox and restarting, I was subject to a single, first-time 20 second hit again before everything worked smoothly.
So yes - very good work. Thank you!
Assignee | ||
Updated•14 years ago
|
Attachment #520975 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 28•14 years ago
|
||
Fix the bitrot introduced from 280661 and 645263.. it would be great to get this reviewed and waiting on the train early.
Attachment #520975 -
Attachment is obsolete: true
Attachment #520975 -
Flags: review?(bzbarsky)
Attachment #526490 -
Flags: review?(bzbarsky)
Comment 29•14 years ago
|
||
Comment on attachment 526490 [details] [diff] [review]
skip-bad-portions-of-dns-rr v3
Please document the mLastIter member.
In nsDNSRecord::ReportUnusable, this line:
>+ if (mIter && mHostRecord) {
Makes no sense: we have already dereferenced mHostRecord. Furthermore, mHostRecord should not be null in general. Just remove that null-check, unless there's something deep I'm missing here.
Also, mIter in ReportUnusable will be null if mLastIter points to the last enumerable value. But we do want to blacklist that value in that situation. So why the null-check on mIter there?
On the other hand, you do want to check the generation count here, since the mLastIter could just be completely invalid now, right?
>+ nsCOMPtr<nsDNSRecord> record = do_QueryInterface(aRecord, &rv);
That QI will always succeed, since nsDNSRecord is a subclass of nsIDNSRecord and does not define a separate IID.
In other words, that line is equivalent to a static_cast. If that's safe, just do one. If not, another solution is needed.
Why not just change nsIDNSRecord instead of adding this "go via the service" back door?
>+ nsCAutoString strQuery(buf);
nsDependentCString ?
Why not just make mBlacklistedItems an nsTArray<nsCString>* and avoid the extra heap allocations of nsCString?
For that matter, an empty nsTArray is one word, so there's no reason to use a pointer to an nsTArray here.
The rest looks fine, I think.... And it's still early in the cycle (though I suspect you meant the previous cycle in your comments. :( ).
Attachment #526490 -
Flags: review?(bzbarsky) → review-
Comment 30•14 years ago
|
||
> I got an initial 20 second timeout (as expected), afterward the
> entire page loaded quickly.
Sounds like a separate bug, but should we be more aggressive than the 20-second timeout, i.e. try the 2nd entry after some <20 sec interval has passed?
Comment 31•14 years ago
|
||
As I understand it, Chrome might do something like that. Agreed that it's a separate bug.
Assignee | ||
Comment 32•14 years ago
|
||
My apologies for the latency on getting back to this; I know that makes it harder for everyone.
most of the review was simply applied, thanks for the help!
(In reply to comment #29)
> >+ if (mIter && mHostRecord) {
> [..]
> So why the null-check on mIter there?
It should have been (and now is) a check for mHostRecord->addr_info, to make sure we were dealing with a list.
> Why not just make mBlacklistedItems [..]
> For that matter, an empty nsTArray is one word, so there's no reason to use
> a pointer to an nsTArray here.
mBlacklistedItems remains a pointer instead of an object because nsHostRecord is allocated in a 'special' way. Instead of using the nsHostRecord ctor, it explictily calls ::operator new(size), where size is sizeof(nsHostRecord) + the length of the hostname. The hostname is stored at the end of the nsHostRecord object. As a result, none of the members of nsHostRecord have their ctors run.. nsHostRecord doens't have any class objects as members. I don't really know a way around that, assuming we keep the all-in-one allocation.. maybe there is a way?
If I expected reportunusable() to be a common occurrance, I would just make nsHostRecord use a proper ctor with a pointer to the hostname (trade an alloc for an alloc), but most of the time mBlacklistedItems will just stay null for the duration of the object.
Attachment #526490 -
Attachment is obsolete: true
Attachment #547077 -
Flags: review?(bzbarsky)
Comment 33•14 years ago
|
||
Comment on attachment 547077 [details] [diff] [review]
skip-bad-portions-of-dns-rr-v4
>+nsresult
>+nsDNSRecord::ReportUnusable(PRUint16 aPort)
That needs to be NS_IMETHODIMP for this to compile on Windows.
>+ // Check that wer are using a real addr_info (as opposed to a single
s/wer/we/
>+++ b/netwerk/dns/nsIDNSRecord.idl
>+ * This function indcates that the last address obtained via getNextAddr*()
s/indcates/indicates/
As far as nsHostRecord memory management...
There are two main sane options there.
One is to implement a custom operator new on nsHostRecord that takes the hostLen or the nsHostKey or whatever as an argument and returns an allocation of the appropriate size, and then make nsHostRecord::Create just call new nsHostRecord. You can look at nsCSSCompressedDataBlock (in nsCSSDataBlock.h) as an example of such an operator new.
Another is to keep things as they are but explicitly construct an nsHostRecord at the memory location you just allocated using placement new in nsHostRecord::Create. You can look at nsCookie::Create for an example of this pattern.
I'd really prefer that you do one or the other, which will allow you to make the blacklist an nsTArray instead of an nsTArray*. In either case, you should probably make the nsHostRecord constructor private so the only way to allocate one is via nsHostRecord::Create...
Also, I don't think you need the ResetBlacklist() call in ~nsHostRecord now that the array destructor will do the right thing.
r=me with the above dealt with.
Oh, and the checkin comment should probably be more along the lines of "Blacklist non-responding IP addresses for a hostname so the next time we access that hostname we don't have to wait for a timeout again" or some such, describing what the patch actually does as opposed to what the bug summary is.
Attachment #547077 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Comment 34•14 years ago
|
||
this is the patch which is at try and I hope to push if that goes well. fyi.
I used placement new, which I had never used in conjunction with delete before. funky!
In addition to mBlacklistedItems which I added, there was also a mutex defined as a pointer for the same reason so it also became a member object.
Attachment #547077 -
Attachment is obsolete: true
Attachment #547224 -
Flags: review+
Comment 35•14 years ago
|
||
Patch 5 looks good. :)
Assignee | ||
Updated•14 years ago
|
Whiteboard: [inbound]
Comment 36•14 years ago
|
||
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla8
You need to log in
before you can comment on or make changes to this bug.
Description
•