Closed Bug 70213 Opened 24 years ago Closed 21 years ago

XP_UNIX DNS lookups are serialized

Categories

(Core :: Networking, defect, P2)

x86
Linux
defect

Tracking

()

VERIFIED FIXED
mozilla1.6alpha

People

(Reporter: darin.moz, Assigned: darin.moz)

References

Details

(Keywords: perf)

Attachments

(2 files, 2 obsolete files)

XP_UNIX DNS lookups are serialized.  Allowing only one DNS lookup to be
performed at a time has serious drawbacks as a "slow lookup" would noticably
block all other lookups, leading to poor perceived performance.

The current implementation results from the fact that NSPR 4.0 does not export
a re-entrant gethostbyname system call.  An upgrade to NSPR 4.1 would solve
this problem, allowing for a thread pool based DNS lookup scheme.

This bug is a follow up to bug 10733.
Setting milestone = Future.
Target Milestone: --- → Future
Blocks: 66404
Blocks: 61683
Blocks: 73392
qa to me.
QA Contact: tever → benc
No longer blocks: 73392
Depends on: 78471
Since NSPR 4.2 has landed, is this now fixed, or is the "rework" noted in 10733
for after NSPR 4.1's async DNS still required for this to be really and truly
fixed the right way?
the NSPR 4.2 landing does not fix this, but it does provide us with a way to fix
this.  the solution on our end would require spawning multiple DNS request
threads (on which we'd call the reentrant PR_GetIPNodeByName).
This should probably be prioritized up some. I know several people, including
myself, who can't use Mozilla effectively because it tends to "lock up" while
loading pages from sites that reference ad.doubleclick.net and other ad-banner
servers. One slow/dead DNS server and Mozilla can sit there for five minutes
before it finally times out and allows one to browse again.

This problem is keeping me from using Mozilla as my primary browser. I realize
there's a lot of other work to do, but parallelizing DNS lookups would greatly
increase performance and perceived stability.

-John
good point, but i'm still not sure that this will make it for mozilla 1.0
I think browser hang bugs are a pretty big deal, so I'm nominating it for 1.0. 
We've hit this when doing LDAP lookups for mailnews, and I'm sure it gets hit in
other places as well.
Keywords: mozilla1.0
*** Bug 137444 has been marked as a duplicate of this bug. ***
Keywords: perf
Bug 106031 and bug 94793 might both also be dupes, are at least commented as
possible ones... (could someone please look after them?)
*** Bug 134687 has been marked as a duplicate of this bug. ***
*** Bug 106031 has been marked as a duplicate of this bug. ***
*** Bug 94793 has been marked as a duplicate of this bug. ***
Here's a patch that implements asynchronous DNS lookups for UNIX that uses the
technique described in comment #4. It works for me, but I'm not much of an
expert on threads so there may be some race issues or whatever.

Note that this only works on platforms where gethostbyname is reentrant. On
other platforms, it will still fire up multiple threads, but only one thread at
a time will enter the critical section in PR_GetIPNodeByName which defeats the
whole point. So this really only works on platforms which define
_PR_HAVE_GETHOST_R or _PR_HAVE_THREADSAFE_GETHOST.

All the code for this change is within the following define:
#if defined(XP_UNIX) && defined(ASYNC_DNS)

I just stuck an '#define ASYNC_DNS 1' at the top of nsDnsService.h as I'm not
really familiar with how to integrate an option like this into the build
system.
peter: thank you for the patch.  i unfortunately do not have time right now to
review this patch as carefully as i should.  it has been on my list of things to
do since you posted it, but i've just been too busy with mozilla 1.0 bugs. 
please don't be discouraged by the lack of response to your patch... we really
appreciate your contribution.  right now, however, it's just not top priority,
and there are a number of more critical bugs stealing everyones attention.
*** Bug 170694 has been marked as a duplicate of this bug. ***
[RFE] is deprecated in favor of severity: enhancement.  They have the same meaning.
Severity: normal → enhancement
This isn't just a small performance issue that could be tweaked, it could be a
very big problem when you have one unresolvable name that you try to resolve
every time. I wrote about this side-effect in Bug 170694. When the only
nameserver responsible for my resolving hostname becomes unreachable, mozilla
adds a lookup for the name on every page load and startup. This means startup
and every page load is delayed from teh start by a timeout (on the order of 20
seconds) thanks to the the first NS request -- for my hostname -- blocking.
Changing Severity back to normal per comment #17.
Severity: enhancement → normal
Summary: [RFE] XP_UNIX DNS lookups are serialized → XP_UNIX DNS lookups are serialized
I'll take this.  It's DNS and I've got fewer bugs than darin...at the moment :-)
Assignee: darin → gordon
Priority: -- → P2
Blocks: 181970
Status: NEW → ASSIGNED
*** Bug 192579 has been marked as a duplicate of this bug. ***
Do we have a revised milestone on this?  This for me really needs to be fixed
soon and is a major blocker for us web developers.
I'd like to get this fixed for 1.4.
Target Milestone: Future → mozilla1.4alpha
  This bug is driving me insane.  I used to notice this back around milestone 13
on FreeBSD 3.X.  Somewhere in there it disappeared for about a year on my BSD
machines.. now with FreBSD 5.0 and mozilla 1.3b it's back.  Sites such as
cnn.com or nhl.com can take on the order of minutes to render.  Usually with
"resolving ad.doubleclick.net" in the status bar.  Some other sites have
problems with resolving other DNS records.  If it helps any, I noticed that
under FreeBSD 5.0 it can take 20 seconds to resolve each hostname.  The A record
pops up right away, but the MX record takes 20 seconds (even if there is no MX
record the resolver will hang for 20 seconds).
  Well, I temporarily solved my problem by turning off IPv6.. which is related
to another bug with FreeBSD/mozilla that results in "connection refused"
messages (that bug is already filed).  My ISP's nameservers were ignoring IPv6
(AAAA) queries.  But I'd still prefer if all of my tabs didn't hang when the
browser is stuck resolving one address.
gordon: is this going to be hard to fix? if you cannot fix this for 1.4, can you
think about fixing the bugs that get jammed up because we are serialized?
The last day or two have really brought home why this is sorely needed, since
ad.doubleclick.net has been timing out for me, causing many pages to take over a
minute to render due to the long delays while the lookups time out!

A candidate patch was posted almost a year ago, can't this be moved higher up
the list?  This is alpha time, its unlikely a patch like this would be approved
during beta time so otherwise its at least three more months before this gets fixed.

There are a pretty fair number of votes for a Unix-only bug, I'm not the only
one who has been waiting forever on this (no pun intended...)
Unfortunately that patch that I did has some problems. I didn't really
understand what I was doing and the patch as written causes segfaults. I did fix
those, but at the cost of memory leaks so never stuck that patch in there.
Besides, I think the fix is sort of wrong anyway. Instead of threads I think it
would be better to use async DNS or something similar, but I think that would be
a lot more work. Either way, I have no time to try and fix this problem.

I agree with a lot of what other people have said. This is a major bug as far as
I'm concerned. It really wrecks how well Mozilla works on platforms with IPv6
enabled.
This bug has caused me to restart Mozilla countless times over the _years_.
Obviously, votes aren't working, and I do not have the skills to fix this. I
will contribute a dollar for every vote for this bug up to $100 USD to the
person or persons who are major parts of the effort to close this bug, when this
bug is closed. Perhaps we need a "money" keyword?
it's a particularly big bug because mozilla would _feel_ much faster with
parallel NS lookups, ignoring for the moment the dealbreaker problems it would
also solve. maybe most people think the time hit isn't too big, but consider
some sites where there are dozens or hundreds of images inlined from different
sources... it adds up. though it looks like someone has bumped the target to the
next release (1.4a), so ben ya might only be out about 40 bucks..
This one is driving me nuts as well. If this is still valid for the new Mozilla
Firebird architecture, I'll throw another 20$ in for this being solved...
Ok, I'm going to take a shot at implementing this, as it has been a PITA for me
since forever.  This will be my first shot at hacking Moz, so we'll see how it
goes.  
As far as I can tell the code involved is low-level enough that the Firebird
change won't affect it at all (someone please tell me if I'm wrong on this count
before I waste a lot of time)
Wish me luck...
This should do as a starting point, at least.  It's based heavily on the
previous patch by Peter Haight; the only major difference is that I've
eliminated the "main" DNS thread as it's no longer necessary.  Each DNS lookup
now runs in its own thread and thus in parallel without one blocking the other.
  

At compile time the changes depend upon:
 #if defined(XP_UNIX) && defined(UNIX_ASYNC_DNS)
UNIX_ASYNC_DNS is defined by the configure script on *all* Unix builds.  This
needs refining as there are versions without a reentrant gethostbyname() call. 
(FreeBSD is one, I think)

I've been using it for a few hours and no problems to report yet (the only
thing I noticed is that the app will hang on shutdown until any pending lookups
are complete, but it does that without the patch anyways).  More testing is
needed, obviously, but I'd like to get some more eyes on this as well, so here
it is.	

Finally, if anyone wants to test this but can't or won't make a complete build,
I can provide a copy of my libnecko.so (Linux/Intel, gcc 2.95.3, glibc 2.2.3).
This should do as a starting point, at least.  It's based heavily on the
previous patch by Peter Haight; the only major difference is that I've
eliminated the "main" DNS thread as it's no longer necessary.  Each DNS lookup
now runs in its own thread and thus in parallel without one blocking the other.
  

At compile time the changes depend upon:
 #if defined(XP_UNIX) && defined(UNIX_ASYNC_DNS)
UNIX_ASYNC_DNS is defined by the configure script on *all* Unix builds.  This
needs refining as there are versions without a reentrant gethostbyname() call. 
(FreeBSD is one, I think)

I've been using it for a few hours and no problems to report yet (the only
thing I noticed is that the app will hang on shutdown until any pending lookups
are complete, but it does that without the patch anyways).  More testing is
needed, obviously, but I'd like to get some more eyes on this as well, so here
it is.	

Finally, if anyone wants to test this but can't or won't make a complete build,
I can provide a copy of my libnecko.so (Linux/Intel, gcc 2.95.3, glibc 2.2.3).
Blocks: majorbugs
OpenVMS doesn't have reentrant DNS routines, so I need to remove UNIX_ASYNC_DNS.
See bug 208457.
Hmmm, major leakage in my first patch.  Update coming soon.  (sorry for the
double comment, by the way)
Correct.  FreeBSD does _NOT_ have a gethostbyname_r().  The symbol in libc_r is
merely the normal one.

This is the case from 2.2.x up to and including 4.x.  I am almost sure 5.x is
still not reentrant as well.
AFAIK, Mac OS X doesn't support gethostbyname_r() either. Bummer.
That's ok, Macs have something better: OTInetStringToAddress(), a proper
asynchronous DNS call.  Don't know what part of the system it's implemented in,
but Moz doesn't use anything like gethostbyname() on XP_MAC.
Ummm, OS X *is* compiled with XP_MAC and not XP_UNIX right?
nope :)

defined XP_UNIX && defined XP_MACOSX
For systems without a re-entrant gethostbyname(), how about using getaddrinfo()?
 FreeBSD 4.x has it.
i spoke with wtc yesterday concerning this bug (and others related to it), and
we came up with a way to expose getaddrinfo via NSPR.  a good number of
platforms (including WinXP) support getaddrinfo.  OSX also supports it. 
moreover, we spoke about using detachable threads for the DNS queries.  that is
going to be required in order to avoid shutdown hangs.  however, it means that
we could get into trouble if the primordial thread ever calls PR_Cleanup (which
it currently does not).

timeless: i seem to recall seeing a patch about adding PR_Cleanup to XPCOM
shutdown.  we might need to WONTFIX that bug.

steve: thanks for the patch.  at this point i'm strongly considering a full
rewrite of the DNS service; however, that may be too ambitious, in which case
something like what you've put together may be exactly the start of what we need.

bug 205726 has many details about the changes i want to make to the DNS service.
Depends on: 205726
the windows dns service can hang if you're unlucky enough to use a debugger near
shutdown. It's a clear case of mixing two incompatible message passing systems
(windows, nspr) - the windows message can be lost while running under the
debugger, and then nspr waits expectfully for the thread to finish while the
thread continues to wait for the already lost message.

believe me when i say i'll be unhappy if we wontfix the pr_cleanup bug.

i'm still unclear about the precise problem which caused gordon to suggest
unjoinable threads. i'm also unsure as to why it would be bad to call pr_cleanup
with those unjoinable threads running around. we're quiting, all data is gone.
what's the worst thing that happens? we crash?
timeless: in order to support IPv6 under windows XP, we must call getaddrinfo. 
in fact, microsoft suggests that applications use getaddrinfo in place of any
other mechanism.  that means, we will be doing away with the existing WINSOCK
based DNS solution.  moreover, the idea is to have one implementation for all
platforms.  on platforms which support getaddrinfo (or at least one of the other
re-entrant forms of gethostbyname) we will spawn several threads so that a slow
to resolve DNS query does not completely block the browser from visiting other
sites.  that is the problem we have today with the current XP_UNIX impl, and
indeed it is what this bug is about.  also, we hit the same problem at shutdown
when we attempt to join with the DNS thread.  if that DNS thread has not
returned from gethostbyname (or equivalent function), then the UI thread will
hang waiting for the join operation to complete.  this hang can last for a long
time (minutes!)... such that users usually think moz has died.  the proposed
solution is to use detached threads.  each will own a reference to the DNS
service so they can do what they need to do after "gethostbyname" returns, but
the UI thread will not need to join with these threads before itself going away.
 the problem of course is that NSPR has many globally allocated resources that
may be in use by these worker threads.  we may certainly crash if the UI thread
calls PR_Cleanup while one of these worker threads is still active.  wtc said
that he recommends not calling PR_Cleanup... that it was really only introduced
into NSPR to make Purify happy.  he also said that NSPR is not designed to be
restarted, as XPCOM is.  so binding NSPR shutdown to XPCOM shutdown is not
advised.  at most PR_Cleanup is something that an application embedding gecko
might want to call.  but, even still... that application would not know when it
is safe to call PR_Cleanup.  for our purposes, i think it should never be
called.  btw: the cache also spins up a detached thread when it needs to discard
an old cache directory (that has been determined to be corrupt), so we'd be in
trouble if we called PR_Cleanup today.
I now understand that my patch, by eliminating the main DNS thread, requires
unjoinable threads simply because there's nowhere to join them.  Reinstating the
main "dispatcher" thread would remove that requirement if necessary.

Using getaddrinfo() would be great, and my code could be used as-is by
substituting in PR_getaddrinfo() or whatever.  However I'd consider keeping the
true async DNS implementations where available, such as under XP_WIN and XP_MAC,
as they don't require spawning who knows how many threads at once (think
groupmarks).

Now, ummm...  if I finish this, what are the odds it'll make it into the code? 
Is it worth it?
i don't know where to put this question, so i'm dumping it here:

you said the parallel approach would have a limited number of threads. what
happens if all of them get stuck? :-)

[No, i'm not suggesting an unlimited number of threads, I know there are os
limits and that mozilla would be *very* unhappy if it couldn't get threads for
other important things. i'm just playing devil's advocate]
Updated patch, fixes a few problems with the original.	As before, there is no
main long-lived DNS thread.  The worker threads are unjoinable and there is in
fact *no* limit the number of threads that can be launched simultaneously.  I
guess I'll fix that next, but due to the great unlikeliness of running out of
threads, this patch should be perfectly fine for testing in the meantime.
Attachment #124699 - Attachment is obsolete: true
Attachment #124700 - Attachment is obsolete: true
Blocks: 154816
I'm nearly done with my patch, just testing for leaks at the moment.

In fact I think I've found a pre-existing one; can someone with a better
knowledge of the codebase confirm or deny?

At line 1520 of nsDnsService.cpp, in function nsDNSService::Lookup():

        NS_ADDREF(request); // for caller

The request object in question is returned via a pointer argument passed to the
function and the comment seems to imply that the caller should call the
corresponding RELEASE(); as well, it doesn't seem to happen with the DNS code
itself.  However in the code that use the DNS service (nsSocketTransport2.cpp
and nsLDAPConnection.cpp) I can't find the RELEASE() either.  valgrind doesn't
like mozilla all at once, but it's reporting the request object leaked when the
DNS service is used from a small test program.

So am I missing something or is this really a leak?  For the moment I'll take
that ADDREF() out of the code I'm working on and see if it causes any problems.
*** Bug 213160 has been marked as a duplicate of this bug. ***
-> me
Assignee: gordon → darin
Status: ASSIGNED → NEW
this bug is fixed by the patch for bug 205726.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Target Milestone: mozilla1.4alpha → mozilla1.6alpha
This is regarding comment 28.

Now that this bug is fixed, I can send the $100 I pledged. Now that Mozilla.org
is a foundation, would any of the involved developers mind if I just donated to
Mozilla.org?
ben: yes, please donate your money to the mozilla foundation instead! ;-)
Thanks for solving this. I'll send my promised $20 to the foundation as well...
VERIFIED/fixed
per darin.
Status: RESOLVED → VERIFIED
In which version is it fixed?
in 1.6Alpha or later builds
This problem does not appear to be fixed on at least FreeBSD 4.9.  First noticed
this issue with firefox 0.8 so I tested mozilla 1.6 and 1.7a and mozilla still
appears to have the same problems.  I ran a packet capture with ethereal while
trying to load www.allmusic.com and it shows multiple type AAAA queries being
made that ultimately receive a "Server failed" response.  Eventually a type A
request is made which receives back a good response.  This site takes on average
almost 4 minutes to load in its entirety.  I also found that if I hit stop and
then try to go to another site mozilla will continue to try and query the old
site for a period of time before trying the new site.

A few notes about recreating this on FreeBSD.  The resolver in the -CURRENT
branch is now different from the one in 4.9 and the -STABLE branch.  I have  not
had a chance to test this bug on -CURRENT.  Also, INET6 being enabled in the
kernel (default) and net.inet6.ip6.auto_linklocal not being set to 0 in
/etc/sysctl.conf will cause the problem to be more evident.
There's a new bug 233366 that covers an issue that seems related to this one. 

Please migrate your CC's if you experience similar problems as mentioned there.
No longer blocks: majorbugs
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: