Closed
Bug 70213
Opened 24 years ago
Closed 21 years ago
XP_UNIX DNS lookups are serialized
Categories
(Core :: Networking, defect, P2)
Tracking
()
VERIFIED
FIXED
mozilla1.6alpha
People
(Reporter: darin.moz, Assigned: darin.moz)
References
Details
(Keywords: perf)
Attachments
(2 files, 2 obsolete files)
7.17 KB,
patch
|
Details | Diff | Splinter Review | |
11.16 KB,
patch
|
Details | Diff | Splinter Review |
XP_UNIX DNS lookups are serialized. Allowing only one DNS lookup to be performed at a time has serious drawbacks as a "slow lookup" would noticably block all other lookups, leading to poor perceived performance. The current implementation results from the fact that NSPR 4.0 does not export a re-entrant gethostbyname system call. An upgrade to NSPR 4.1 would solve this problem, allowing for a thread pool based DNS lookup scheme. This bug is a follow up to bug 10733.
Since NSPR 4.2 has landed, is this now fixed, or is the "rework" noted in 10733 for after NSPR 4.1's async DNS still required for this to be really and truly fixed the right way?
Assignee | ||
Comment 4•23 years ago
|
||
the NSPR 4.2 landing does not fix this, but it does provide us with a way to fix this. the solution on our end would require spawning multiple DNS request threads (on which we'd call the reentrant PR_GetIPNodeByName).
Comment 5•23 years ago
|
||
This should probably be prioritized up some. I know several people, including myself, who can't use Mozilla effectively because it tends to "lock up" while loading pages from sites that reference ad.doubleclick.net and other ad-banner servers. One slow/dead DNS server and Mozilla can sit there for five minutes before it finally times out and allows one to browse again. This problem is keeping me from using Mozilla as my primary browser. I realize there's a lot of other work to do, but parallelizing DNS lookups would greatly increase performance and perceived stability. -John
Assignee | ||
Comment 6•23 years ago
|
||
good point, but i'm still not sure that this will make it for mozilla 1.0
Comment 7•23 years ago
|
||
I think browser hang bugs are a pretty big deal, so I'm nominating it for 1.0. We've hit this when doing LDAP lookups for mailnews, and I'm sure it gets hit in other places as well.
Keywords: mozilla1.0
Comment 8•22 years ago
|
||
*** Bug 137444 has been marked as a duplicate of this bug. ***
Comment 9•22 years ago
|
||
Bug 106031 and bug 94793 might both also be dupes, are at least commented as possible ones... (could someone please look after them?)
Comment 10•22 years ago
|
||
*** Bug 134687 has been marked as a duplicate of this bug. ***
Comment 11•22 years ago
|
||
*** Bug 106031 has been marked as a duplicate of this bug. ***
Comment 12•22 years ago
|
||
*** Bug 94793 has been marked as a duplicate of this bug. ***
Comment 13•22 years ago
|
||
Here's a patch that implements asynchronous DNS lookups for UNIX that uses the technique described in comment #4. It works for me, but I'm not much of an expert on threads so there may be some race issues or whatever. Note that this only works on platforms where gethostbyname is reentrant. On other platforms, it will still fire up multiple threads, but only one thread at a time will enter the critical section in PR_GetIPNodeByName which defeats the whole point. So this really only works on platforms which define _PR_HAVE_GETHOST_R or _PR_HAVE_THREADSAFE_GETHOST. All the code for this change is within the following define: #if defined(XP_UNIX) && defined(ASYNC_DNS) I just stuck an '#define ASYNC_DNS 1' at the top of nsDnsService.h as I'm not really familiar with how to integrate an option like this into the build system.
Assignee | ||
Comment 14•22 years ago
|
||
peter: thank you for the patch. i unfortunately do not have time right now to review this patch as carefully as i should. it has been on my list of things to do since you posted it, but i've just been too busy with mozilla 1.0 bugs. please don't be discouraged by the lack of response to your patch... we really appreciate your contribution. right now, however, it's just not top priority, and there are a number of more critical bugs stealing everyones attention.
Comment 15•22 years ago
|
||
*** Bug 170694 has been marked as a duplicate of this bug. ***
Comment 16•22 years ago
|
||
[RFE] is deprecated in favor of severity: enhancement. They have the same meaning.
Severity: normal → enhancement
Comment 17•22 years ago
|
||
This isn't just a small performance issue that could be tweaked, it could be a very big problem when you have one unresolvable name that you try to resolve every time. I wrote about this side-effect in Bug 170694. When the only nameserver responsible for my resolving hostname becomes unreachable, mozilla adds a lookup for the name on every page load and startup. This means startup and every page load is delayed from teh start by a timeout (on the order of 20 seconds) thanks to the the first NS request -- for my hostname -- blocking.
Comment 18•22 years ago
|
||
Changing Severity back to normal per comment #17.
Severity: enhancement → normal
Updated•22 years ago
|
Summary: [RFE] XP_UNIX DNS lookups are serialized → XP_UNIX DNS lookups are serialized
Comment 19•22 years ago
|
||
I'll take this. It's DNS and I've got fewer bugs than darin...at the moment :-)
Assignee: darin → gordon
Priority: -- → P2
Comment 20•22 years ago
|
||
*** Bug 192579 has been marked as a duplicate of this bug. ***
Comment 21•22 years ago
|
||
Do we have a revised milestone on this? This for me really needs to be fixed soon and is a major blocker for us web developers.
Comment 22•22 years ago
|
||
I'd like to get this fixed for 1.4.
Target Milestone: Future → mozilla1.4alpha
Comment 23•22 years ago
|
||
This bug is driving me insane. I used to notice this back around milestone 13 on FreeBSD 3.X. Somewhere in there it disappeared for about a year on my BSD machines.. now with FreBSD 5.0 and mozilla 1.3b it's back. Sites such as cnn.com or nhl.com can take on the order of minutes to render. Usually with "resolving ad.doubleclick.net" in the status bar. Some other sites have problems with resolving other DNS records. If it helps any, I noticed that under FreeBSD 5.0 it can take 20 seconds to resolve each hostname. The A record pops up right away, but the MX record takes 20 seconds (even if there is no MX record the resolver will hang for 20 seconds).
Comment 24•22 years ago
|
||
Well, I temporarily solved my problem by turning off IPv6.. which is related to another bug with FreeBSD/mozilla that results in "connection refused" messages (that bug is already filed). My ISP's nameservers were ignoring IPv6 (AAAA) queries. But I'd still prefer if all of my tabs didn't hang when the browser is stuck resolving one address.
Comment 25•22 years ago
|
||
gordon: is this going to be hard to fix? if you cannot fix this for 1.4, can you think about fixing the bugs that get jammed up because we are serialized?
Comment 26•21 years ago
|
||
The last day or two have really brought home why this is sorely needed, since ad.doubleclick.net has been timing out for me, causing many pages to take over a minute to render due to the long delays while the lookups time out! A candidate patch was posted almost a year ago, can't this be moved higher up the list? This is alpha time, its unlikely a patch like this would be approved during beta time so otherwise its at least three more months before this gets fixed. There are a pretty fair number of votes for a Unix-only bug, I'm not the only one who has been waiting forever on this (no pun intended...)
Comment 27•21 years ago
|
||
Unfortunately that patch that I did has some problems. I didn't really understand what I was doing and the patch as written causes segfaults. I did fix those, but at the cost of memory leaks so never stuck that patch in there. Besides, I think the fix is sort of wrong anyway. Instead of threads I think it would be better to use async DNS or something similar, but I think that would be a lot more work. Either way, I have no time to try and fix this problem. I agree with a lot of what other people have said. This is a major bug as far as I'm concerned. It really wrecks how well Mozilla works on platforms with IPv6 enabled.
Comment 28•21 years ago
|
||
This bug has caused me to restart Mozilla countless times over the _years_. Obviously, votes aren't working, and I do not have the skills to fix this. I will contribute a dollar for every vote for this bug up to $100 USD to the person or persons who are major parts of the effort to close this bug, when this bug is closed. Perhaps we need a "money" keyword?
Comment 29•21 years ago
|
||
it's a particularly big bug because mozilla would _feel_ much faster with parallel NS lookups, ignoring for the moment the dealbreaker problems it would also solve. maybe most people think the time hit isn't too big, but consider some sites where there are dozens or hundreds of images inlined from different sources... it adds up. though it looks like someone has bumped the target to the next release (1.4a), so ben ya might only be out about 40 bucks..
Comment 30•21 years ago
|
||
This one is driving me nuts as well. If this is still valid for the new Mozilla Firebird architecture, I'll throw another 20$ in for this being solved...
Comment 31•21 years ago
|
||
Ok, I'm going to take a shot at implementing this, as it has been a PITA for me since forever. This will be my first shot at hacking Moz, so we'll see how it goes. As far as I can tell the code involved is low-level enough that the Firebird change won't affect it at all (someone please tell me if I'm wrong on this count before I waste a lot of time) Wish me luck...
Comment 32•21 years ago
|
||
This should do as a starting point, at least. It's based heavily on the previous patch by Peter Haight; the only major difference is that I've eliminated the "main" DNS thread as it's no longer necessary. Each DNS lookup now runs in its own thread and thus in parallel without one blocking the other. At compile time the changes depend upon: #if defined(XP_UNIX) && defined(UNIX_ASYNC_DNS) UNIX_ASYNC_DNS is defined by the configure script on *all* Unix builds. This needs refining as there are versions without a reentrant gethostbyname() call. (FreeBSD is one, I think) I've been using it for a few hours and no problems to report yet (the only thing I noticed is that the app will hang on shutdown until any pending lookups are complete, but it does that without the patch anyways). More testing is needed, obviously, but I'd like to get some more eyes on this as well, so here it is. Finally, if anyone wants to test this but can't or won't make a complete build, I can provide a copy of my libnecko.so (Linux/Intel, gcc 2.95.3, glibc 2.2.3).
Comment 33•21 years ago
|
||
This should do as a starting point, at least. It's based heavily on the previous patch by Peter Haight; the only major difference is that I've eliminated the "main" DNS thread as it's no longer necessary. Each DNS lookup now runs in its own thread and thus in parallel without one blocking the other. At compile time the changes depend upon: #if defined(XP_UNIX) && defined(UNIX_ASYNC_DNS) UNIX_ASYNC_DNS is defined by the configure script on *all* Unix builds. This needs refining as there are versions without a reentrant gethostbyname() call. (FreeBSD is one, I think) I've been using it for a few hours and no problems to report yet (the only thing I noticed is that the app will hang on shutdown until any pending lookups are complete, but it does that without the patch anyways). More testing is needed, obviously, but I'd like to get some more eyes on this as well, so here it is. Finally, if anyone wants to test this but can't or won't make a complete build, I can provide a copy of my libnecko.so (Linux/Intel, gcc 2.95.3, glibc 2.2.3).
Comment 34•21 years ago
|
||
OpenVMS doesn't have reentrant DNS routines, so I need to remove UNIX_ASYNC_DNS. See bug 208457.
Comment 35•21 years ago
|
||
Hmmm, major leakage in my first patch. Update coming soon. (sorry for the double comment, by the way)
Comment 36•21 years ago
|
||
Correct. FreeBSD does _NOT_ have a gethostbyname_r(). The symbol in libc_r is merely the normal one. This is the case from 2.2.x up to and including 4.x. I am almost sure 5.x is still not reentrant as well.
Comment 37•21 years ago
|
||
AFAIK, Mac OS X doesn't support gethostbyname_r() either. Bummer.
Comment 38•21 years ago
|
||
That's ok, Macs have something better: OTInetStringToAddress(), a proper asynchronous DNS call. Don't know what part of the system it's implemented in, but Moz doesn't use anything like gethostbyname() on XP_MAC. Ummm, OS X *is* compiled with XP_MAC and not XP_UNIX right?
Comment 39•21 years ago
|
||
nope :) defined XP_UNIX && defined XP_MACOSX
Comment 40•21 years ago
|
||
For systems without a re-entrant gethostbyname(), how about using getaddrinfo()? FreeBSD 4.x has it.
Assignee | ||
Comment 41•21 years ago
|
||
i spoke with wtc yesterday concerning this bug (and others related to it), and we came up with a way to expose getaddrinfo via NSPR. a good number of platforms (including WinXP) support getaddrinfo. OSX also supports it. moreover, we spoke about using detachable threads for the DNS queries. that is going to be required in order to avoid shutdown hangs. however, it means that we could get into trouble if the primordial thread ever calls PR_Cleanup (which it currently does not). timeless: i seem to recall seeing a patch about adding PR_Cleanup to XPCOM shutdown. we might need to WONTFIX that bug. steve: thanks for the patch. at this point i'm strongly considering a full rewrite of the DNS service; however, that may be too ambitious, in which case something like what you've put together may be exactly the start of what we need. bug 205726 has many details about the changes i want to make to the DNS service.
Depends on: 205726
Comment 42•21 years ago
|
||
the windows dns service can hang if you're unlucky enough to use a debugger near shutdown. It's a clear case of mixing two incompatible message passing systems (windows, nspr) - the windows message can be lost while running under the debugger, and then nspr waits expectfully for the thread to finish while the thread continues to wait for the already lost message. believe me when i say i'll be unhappy if we wontfix the pr_cleanup bug. i'm still unclear about the precise problem which caused gordon to suggest unjoinable threads. i'm also unsure as to why it would be bad to call pr_cleanup with those unjoinable threads running around. we're quiting, all data is gone. what's the worst thing that happens? we crash?
Assignee | ||
Comment 43•21 years ago
|
||
timeless: in order to support IPv6 under windows XP, we must call getaddrinfo. in fact, microsoft suggests that applications use getaddrinfo in place of any other mechanism. that means, we will be doing away with the existing WINSOCK based DNS solution. moreover, the idea is to have one implementation for all platforms. on platforms which support getaddrinfo (or at least one of the other re-entrant forms of gethostbyname) we will spawn several threads so that a slow to resolve DNS query does not completely block the browser from visiting other sites. that is the problem we have today with the current XP_UNIX impl, and indeed it is what this bug is about. also, we hit the same problem at shutdown when we attempt to join with the DNS thread. if that DNS thread has not returned from gethostbyname (or equivalent function), then the UI thread will hang waiting for the join operation to complete. this hang can last for a long time (minutes!)... such that users usually think moz has died. the proposed solution is to use detached threads. each will own a reference to the DNS service so they can do what they need to do after "gethostbyname" returns, but the UI thread will not need to join with these threads before itself going away. the problem of course is that NSPR has many globally allocated resources that may be in use by these worker threads. we may certainly crash if the UI thread calls PR_Cleanup while one of these worker threads is still active. wtc said that he recommends not calling PR_Cleanup... that it was really only introduced into NSPR to make Purify happy. he also said that NSPR is not designed to be restarted, as XPCOM is. so binding NSPR shutdown to XPCOM shutdown is not advised. at most PR_Cleanup is something that an application embedding gecko might want to call. but, even still... that application would not know when it is safe to call PR_Cleanup. for our purposes, i think it should never be called. btw: the cache also spins up a detached thread when it needs to discard an old cache directory (that has been determined to be corrupt), so we'd be in trouble if we called PR_Cleanup today.
Comment 44•21 years ago
|
||
I now understand that my patch, by eliminating the main DNS thread, requires unjoinable threads simply because there's nowhere to join them. Reinstating the main "dispatcher" thread would remove that requirement if necessary. Using getaddrinfo() would be great, and my code could be used as-is by substituting in PR_getaddrinfo() or whatever. However I'd consider keeping the true async DNS implementations where available, such as under XP_WIN and XP_MAC, as they don't require spawning who knows how many threads at once (think groupmarks). Now, ummm... if I finish this, what are the odds it'll make it into the code? Is it worth it?
Comment 45•21 years ago
|
||
i don't know where to put this question, so i'm dumping it here: you said the parallel approach would have a limited number of threads. what happens if all of them get stuck? :-) [No, i'm not suggesting an unlimited number of threads, I know there are os limits and that mozilla would be *very* unhappy if it couldn't get threads for other important things. i'm just playing devil's advocate]
Comment 46•21 years ago
|
||
Updated patch, fixes a few problems with the original. As before, there is no main long-lived DNS thread. The worker threads are unjoinable and there is in fact *no* limit the number of threads that can be launched simultaneously. I guess I'll fix that next, but due to the great unlikeliness of running out of threads, this patch should be perfectly fine for testing in the meantime.
Attachment #124699 -
Attachment is obsolete: true
Attachment #124700 -
Attachment is obsolete: true
Comment 47•21 years ago
|
||
I'm nearly done with my patch, just testing for leaks at the moment. In fact I think I've found a pre-existing one; can someone with a better knowledge of the codebase confirm or deny? At line 1520 of nsDnsService.cpp, in function nsDNSService::Lookup(): NS_ADDREF(request); // for caller The request object in question is returned via a pointer argument passed to the function and the comment seems to imply that the caller should call the corresponding RELEASE(); as well, it doesn't seem to happen with the DNS code itself. However in the code that use the DNS service (nsSocketTransport2.cpp and nsLDAPConnection.cpp) I can't find the RELEASE() either. valgrind doesn't like mozilla all at once, but it's reporting the request object leaked when the DNS service is used from a small test program. So am I missing something or is this really a leak? For the moment I'll take that ADDREF() out of the code I'm working on and see if it causes any problems.
Assignee | ||
Comment 48•21 years ago
|
||
*** Bug 213160 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 50•21 years ago
|
||
this bug is fixed by the patch for bug 205726.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Target Milestone: mozilla1.4alpha → mozilla1.6alpha
Comment 51•21 years ago
|
||
This is regarding comment 28. Now that this bug is fixed, I can send the $100 I pledged. Now that Mozilla.org is a foundation, would any of the involved developers mind if I just donated to Mozilla.org?
Assignee | ||
Comment 52•21 years ago
|
||
ben: yes, please donate your money to the mozilla foundation instead! ;-)
Comment 53•21 years ago
|
||
Thanks for solving this. I'll send my promised $20 to the foundation as well...
Comment 55•21 years ago
|
||
In which version is it fixed?
Comment 56•21 years ago
|
||
in 1.6Alpha or later builds
Comment 57•20 years ago
|
||
This problem does not appear to be fixed on at least FreeBSD 4.9. First noticed this issue with firefox 0.8 so I tested mozilla 1.6 and 1.7a and mozilla still appears to have the same problems. I ran a packet capture with ethereal while trying to load www.allmusic.com and it shows multiple type AAAA queries being made that ultimately receive a "Server failed" response. Eventually a type A request is made which receives back a good response. This site takes on average almost 4 minutes to load in its entirety. I also found that if I hit stop and then try to go to another site mozilla will continue to try and query the old site for a period of time before trying the new site. A few notes about recreating this on FreeBSD. The resolver in the -CURRENT branch is now different from the one in 4.9 and the -STABLE branch. I have not had a chance to test this bug on -CURRENT. Also, INET6 being enabled in the kernel (default) and net.inet6.ip6.auto_linklocal not being set to 0 in /etc/sysctl.conf will cause the problem to be more evident.
Comment 58•20 years ago
|
||
There's a new bug 233366 that covers an issue that seems related to this one. Please migrate your CC's if you experience similar problems as mentioned there.
You need to log in
before you can comment on or make changes to this bug.
Description
•