70213 - XP_UNIX DNS lookups are serialized

Assignee

Description

•

25 years ago

XP_UNIX DNS lookups are serialized. Allowing only one DNS lookup to be performed at a time has serious drawbacks as a "slow lookup" would noticably block all other lookups, leading to poor perceived performance. The current implementation results from the fact that NSPR 4.0 does not export a re-entrant gethostbyname system call. An upgrade to NSPR 4.1 would solve this problem, allowing for a thread pool based DNS lookup scheme. This bug is a follow up to bug 10733.

Darin Fisher

Assignee

Comment 1

•

25 years ago

Setting milestone = Future.

Target Milestone: --- → Future

Darin Fisher

Assignee

Updated

•

25 years ago

Blocks: 66404

Darin Fisher

Assignee

Updated

•

25 years ago

Blocks: 61683

Boris Zbarsky [:bzbarsky]

Updated

•

24 years ago

Blocks: 73392

benc

Comment 2

•

24 years ago

qa to me.

QA Contact: tever → benc

Darin Fisher

Assignee

Updated

•

24 years ago

No longer blocks: 73392

Steve Freeland

Updated

•

24 years ago

Depends on: 78471

slice1900

Comment 3

•

24 years ago

Since NSPR 4.2 has landed, is this now fixed, or is the "rework" noted in 10733 for after NSPR 4.1's async DNS still required for this to be really and truly fixed the right way?

Darin Fisher

Assignee

Comment 4

•

24 years ago

the NSPR 4.2 landing does not fix this, but it does provide us with a way to fix this. the solution on our end would require spawning multiple DNS request threads (on which we'd call the reentrant PR_GetIPNodeByName).

John Flynn

Comment 5

•

24 years ago

This should probably be prioritized up some. I know several people, including myself, who can't use Mozilla effectively because it tends to "lock up" while loading pages from sites that reference ad.doubleclick.net and other ad-banner servers. One slow/dead DNS server and Mozilla can sit there for five minutes before it finally times out and allows one to browse again. This problem is keeping me from using Mozilla as my primary browser. I realize there's a lot of other work to do, but parallelizing DNS lookups would greatly increase performance and perceived stability. -John

Darin Fisher

Assignee

Comment 6

•

24 years ago

good point, but i'm still not sure that this will make it for mozilla 1.0

Dan Mosedale (:dmosedale, :dmose)

Comment 7

•

24 years ago

I think browser hang bugs are a pretty big deal, so I'm nominating it for 1.0. We've hit this when doing LDAP lookups for mailnews, and I'm sure it gets hit in other places as well.

Keywords: mozilla1.0

Boris Zbarsky [:bzbarsky]

Comment 8

•

23 years ago

*** Bug 137444 has been marked as a duplicate of this bug. ***

Adam Hauner

Updated

•

23 years ago

Keywords: perf

Andreas Kunz

Comment 9

•

23 years ago

Bug 106031 and bug 94793 might both also be dupes, are at least commented as possible ones... (could someone please look after them?)

John Levon

Comment 10

•

23 years ago

*** Bug 134687 has been marked as a duplicate of this bug. ***

Dan Mosedale (:dmosedale, :dmose)

Comment 11

•

23 years ago

*** Bug 106031 has been marked as a duplicate of this bug. ***

Dan Mosedale (:dmosedale, :dmose)

Comment 12

•

23 years ago

*** Bug 94793 has been marked as a duplicate of this bug. ***

Peter Haight

Comment 13

•

23 years ago

Attached patch Asynchronous DNS lookups for UNIX. — Details — Splinter Review

Here's a patch that implements asynchronous DNS lookups for UNIX that uses the technique described in comment #4. It works for me, but I'm not much of an expert on threads so there may be some race issues or whatever. Note that this only works on platforms where gethostbyname is reentrant. On other platforms, it will still fire up multiple threads, but only one thread at a time will enter the critical section in PR_GetIPNodeByName which defeats the whole point. So this really only works on platforms which define _PR_HAVE_GETHOST_R or _PR_HAVE_THREADSAFE_GETHOST. All the code for this change is within the following define: #if defined(XP_UNIX) && defined(ASYNC_DNS) I just stuck an '#define ASYNC_DNS 1' at the top of nsDnsService.h as I'm not really familiar with how to integrate an option like this into the build system.

Darin Fisher

Assignee

Comment 14

•

23 years ago

peter: thank you for the patch. i unfortunately do not have time right now to review this patch as carefully as i should. it has been on my list of things to do since you posted it, but i've just been too busy with mozilla 1.0 bugs. please don't be discouraged by the lack of response to your patch... we really appreciate your contribution. right now, however, it's just not top priority, and there are a number of more critical bugs stealing everyones attention.

Jo Hermans

Comment 15

•

23 years ago

*** Bug 170694 has been marked as a duplicate of this bug. ***

Brant Gurganus

Comment 16

•

23 years ago

[RFE] is deprecated in favor of severity: enhancement. They have the same meaning.

Severity: normal → enhancement

mike

Comment 17

•

23 years ago

This isn't just a small performance issue that could be tweaked, it could be a very big problem when you have one unresolvable name that you try to resolve every time. I wrote about this side-effect in Bug 170694. When the only nameserver responsible for my resolving hostname becomes unreachable, mozilla adds a lookup for the name on every page load and startup. This means startup and every page load is delayed from teh start by a timeout (on the order of 20 seconds) thanks to the the first NS request -- for my hostname -- blocking.

Aleksey Nogin

Comment 18

•

23 years ago

Changing Severity back to normal per comment #17.

Severity: enhancement → normal

Sören 'Chucker' Kuklau (gone)

Updated

•

23 years ago

Summary: [RFE] XP_UNIX DNS lookups are serialized → XP_UNIX DNS lookups are serialized

gordon

Comment 19

•

23 years ago

I'll take this. It's DNS and I've got fewer bugs than darin...at the moment :-)

Assignee: darin → gordon

Priority: -- → P2

Darin Fisher

Assignee

Updated

•

23 years ago

Blocks: 181970

gordon

Updated

•

23 years ago

Status: NEW → ASSIGNED

Boris Zbarsky [:bzbarsky]

Comment 20

•

23 years ago

*** Bug 192579 has been marked as a duplicate of this bug. ***

Kelly Price

Comment 21

•

23 years ago

Do we have a revised milestone on this? This for me really needs to be fixed soon and is a major blocker for us web developers.

gordon

Comment 22

•

23 years ago

I'd like to get this fixed for 1.4.

Target Milestone: Future → mozilla1.4alpha

Brian

Comment 23

•

23 years ago

This bug is driving me insane. I used to notice this back around milestone 13 on FreeBSD 3.X. Somewhere in there it disappeared for about a year on my BSD machines.. now with FreBSD 5.0 and mozilla 1.3b it's back. Sites such as cnn.com or nhl.com can take on the order of minutes to render. Usually with "resolving ad.doubleclick.net" in the status bar. Some other sites have problems with resolving other DNS records. If it helps any, I noticed that under FreeBSD 5.0 it can take 20 seconds to resolve each hostname. The A record pops up right away, but the MX record takes 20 seconds (even if there is no MX record the resolver will hang for 20 seconds).

Brian

Comment 24

•

23 years ago

Well, I temporarily solved my problem by turning off IPv6.. which is related to another bug with FreeBSD/mozilla that results in "connection refused" messages (that bug is already filed). My ISP's nameservers were ignoring IPv6 (AAAA) queries. But I'd still prefer if all of my tabs didn't hang when the browser is stuck resolving one address.

benc

Comment 25

•

23 years ago

gordon: is this going to be hard to fix? if you cannot fix this for 1.4, can you think about fixing the bugs that get jammed up because we are serialized?

slice1900

Comment 26

•

22 years ago

The last day or two have really brought home why this is sorely needed, since ad.doubleclick.net has been timing out for me, causing many pages to take over a minute to render due to the long delays while the lookups time out! A candidate patch was posted almost a year ago, can't this be moved higher up the list? This is alpha time, its unlikely a patch like this would be approved during beta time so otherwise its at least three more months before this gets fixed. There are a pretty fair number of votes for a Unix-only bug, I'm not the only one who has been waiting forever on this (no pun intended...)

Peter Haight

Comment 27

•

22 years ago

Unfortunately that patch that I did has some problems. I didn't really understand what I was doing and the patch as written causes segfaults. I did fix those, but at the cost of memory leaks so never stuck that patch in there. Besides, I think the fix is sort of wrong anyway. Instead of threads I think it would be better to use async DNS or something similar, but I think that would be a lot more work. Either way, I have no time to try and fix this problem. I agree with a lot of what other people have said. This is a major bug as far as I'm concerned. It really wrecks how well Mozilla works on platforms with IPv6 enabled.

Ben Woolley

Comment 28

•

22 years ago

This bug has caused me to restart Mozilla countless times over the _years_. Obviously, votes aren't working, and I do not have the skills to fix this. I will contribute a dollar for every vote for this bug up to $100 USD to the person or persons who are major parts of the effort to close this bug, when this bug is closed. Perhaps we need a "money" keyword?

mike

Comment 29

•

22 years ago

it's a particularly big bug because mozilla would _feel_ much faster with parallel NS lookups, ignoring for the moment the dealbreaker problems it would also solve. maybe most people think the time hit isn't too big, but consider some sites where there are dozens or hundreds of images inlined from different sources... it adds up. though it looks like someone has bumped the target to the next release (1.4a), so ben ya might only be out about 40 bucks..

Sebastian Späth

Comment 30

•

22 years ago

This one is driving me nuts as well. If this is still valid for the new Mozilla Firebird architecture, I'll throw another 20$ in for this being solved...

Steve Freeland

Comment 31

•

22 years ago

Ok, I'm going to take a shot at implementing this, as it has been a PITA for me since forever. This will be my first shot at hacking Moz, so we'll see how it goes. As far as I can tell the code involved is low-level enough that the Firebird change won't affect it at all (someone please tell me if I'm wrong on this count before I waste a lot of time) Wish me luck...

Steve Freeland

Comment 32

•

22 years ago

Attached patch Preliminary patch: Parallelized DNS lookups for XP_UNIX (obsolete) — Details — Splinter Review

This should do as a starting point, at least. It's based heavily on the previous patch by Peter Haight; the only major difference is that I've eliminated the "main" DNS thread as it's no longer necessary. Each DNS lookup now runs in its own thread and thus in parallel without one blocking the other. At compile time the changes depend upon: #if defined(XP_UNIX) && defined(UNIX_ASYNC_DNS) UNIX_ASYNC_DNS is defined by the configure script on *all* Unix builds. This needs refining as there are versions without a reentrant gethostbyname() call. (FreeBSD is one, I think) I've been using it for a few hours and no problems to report yet (the only thing I noticed is that the app will hang on shutdown until any pending lookups are complete, but it does that without the patch anyways). More testing is needed, obviously, but I'd like to get some more eyes on this as well, so here it is. Finally, if anyone wants to test this but can't or won't make a complete build, I can provide a copy of my libnecko.so (Linux/Intel, gcc 2.95.3, glibc 2.2.3).

Steve Freeland

Comment 33

•

22 years ago

Attached patch Preliminary patch: Parallelized DNS lookups for XP_UNIX (obsolete) — Details — Splinter Review

This should do as a starting point, at least. It's based heavily on the previous patch by Peter Haight; the only major difference is that I've eliminated the "main" DNS thread as it's no longer necessary. Each DNS lookup now runs in its own thread and thus in parallel without one blocking the other. At compile time the changes depend upon: #if defined(XP_UNIX) && defined(UNIX_ASYNC_DNS) UNIX_ASYNC_DNS is defined by the configure script on *all* Unix builds. This needs refining as there are versions without a reentrant gethostbyname() call. (FreeBSD is one, I think) I've been using it for a few hours and no problems to report yet (the only thing I noticed is that the app will hang on shutdown until any pending lookups are complete, but it does that without the patch anyways). More testing is needed, obviously, but I'd like to get some more eyes on this as well, so here it is. Finally, if anyone wants to test this but can't or won't make a complete build, I can provide a copy of my libnecko.so (Linux/Intel, gcc 2.95.3, glibc 2.2.3).

Michael Lefevre

Updated

•

22 years ago

Blocks: majorbugs

Colin Blake

Comment 34

•

22 years ago

OpenVMS doesn't have reentrant DNS routines, so I need to remove UNIX_ASYNC_DNS. See bug 208457.

Steve Freeland

Comment 35

•

22 years ago

Hmmm, major leakage in my first patch. Update coming soon. (sorry for the double comment, by the way)

Jeroen Ruigrok van der Werven

Comment 36

•

22 years ago

Correct. FreeBSD does _NOT_ have a gethostbyname_r(). The symbol in libc_r is merely the normal one. This is the case from 2.2.x up to and including 4.x. I am almost sure 5.x is still not reentrant as well.

Jo Hermans

Comment 37

•

22 years ago

AFAIK, Mac OS X doesn't support gethostbyname_r() either. Bummer.

Steve Freeland

Comment 38

•

22 years ago

That's ok, Macs have something better: OTInetStringToAddress(), a proper asynchronous DNS call. Don't know what part of the system it's implemented in, but Moz doesn't use anything like gethostbyname() on XP_MAC. Ummm, OS X *is* compiled with XP_MAC and not XP_UNIX right?

timeless

Comment 39

•

22 years ago

nope :) defined XP_UNIX && defined XP_MACOSX

Sean Farley

Comment 40

•

22 years ago

For systems without a re-entrant gethostbyname(), how about using getaddrinfo()? FreeBSD 4.x has it.

Darin Fisher

Assignee

Comment 41

•

22 years ago

i spoke with wtc yesterday concerning this bug (and others related to it), and we came up with a way to expose getaddrinfo via NSPR. a good number of platforms (including WinXP) support getaddrinfo. OSX also supports it. moreover, we spoke about using detachable threads for the DNS queries. that is going to be required in order to avoid shutdown hangs. however, it means that we could get into trouble if the primordial thread ever calls PR_Cleanup (which it currently does not). timeless: i seem to recall seeing a patch about adding PR_Cleanup to XPCOM shutdown. we might need to WONTFIX that bug. steve: thanks for the patch. at this point i'm strongly considering a full rewrite of the DNS service; however, that may be too ambitious, in which case something like what you've put together may be exactly the start of what we need. bug 205726 has many details about the changes i want to make to the DNS service.

Depends on: 205726

timeless

Comment 42

•

22 years ago

the windows dns service can hang if you're unlucky enough to use a debugger near shutdown. It's a clear case of mixing two incompatible message passing systems (windows, nspr) - the windows message can be lost while running under the debugger, and then nspr waits expectfully for the thread to finish while the thread continues to wait for the already lost message. believe me when i say i'll be unhappy if we wontfix the pr_cleanup bug. i'm still unclear about the precise problem which caused gordon to suggest unjoinable threads. i'm also unsure as to why it would be bad to call pr_cleanup with those unjoinable threads running around. we're quiting, all data is gone. what's the worst thing that happens? we crash?

Darin Fisher

Assignee

Comment 43

•

22 years ago

timeless: in order to support IPv6 under windows XP, we must call getaddrinfo. in fact, microsoft suggests that applications use getaddrinfo in place of any other mechanism. that means, we will be doing away with the existing WINSOCK based DNS solution. moreover, the idea is to have one implementation for all platforms. on platforms which support getaddrinfo (or at least one of the other re-entrant forms of gethostbyname) we will spawn several threads so that a slow to resolve DNS query does not completely block the browser from visiting other sites. that is the problem we have today with the current XP_UNIX impl, and indeed it is what this bug is about. also, we hit the same problem at shutdown when we attempt to join with the DNS thread. if that DNS thread has not returned from gethostbyname (or equivalent function), then the UI thread will hang waiting for the join operation to complete. this hang can last for a long time (minutes!)... such that users usually think moz has died. the proposed solution is to use detached threads. each will own a reference to the DNS service so they can do what they need to do after "gethostbyname" returns, but the UI thread will not need to join with these threads before itself going away. the problem of course is that NSPR has many globally allocated resources that may be in use by these worker threads. we may certainly crash if the UI thread calls PR_Cleanup while one of these worker threads is still active. wtc said that he recommends not calling PR_Cleanup... that it was really only introduced into NSPR to make Purify happy. he also said that NSPR is not designed to be restarted, as XPCOM is. so binding NSPR shutdown to XPCOM shutdown is not advised. at most PR_Cleanup is something that an application embedding gecko might want to call. but, even still... that application would not know when it is safe to call PR_Cleanup. for our purposes, i think it should never be called. btw: the cache also spins up a detached thread when it needs to discard an old cache directory (that has been determined to be corrupt), so we'd be in trouble if we called PR_Cleanup today.

Steve Freeland

Comment 44

•

22 years ago

I now understand that my patch, by eliminating the main DNS thread, requires unjoinable threads simply because there's nowhere to join them. Reinstating the main "dispatcher" thread would remove that requirement if necessary. Using getaddrinfo() would be great, and my code could be used as-is by substituting in PR_getaddrinfo() or whatever. However I'd consider keeping the true async DNS implementations where available, such as under XP_WIN and XP_MAC, as they don't require spawning who knows how many threads at once (think groupmarks). Now, ummm... if I finish this, what are the odds it'll make it into the code? Is it worth it?

timeless

Comment 45

•

22 years ago

i don't know where to put this question, so i'm dumping it here: you said the parallel approach would have a limited number of threads. what happens if all of them get stuck? :-) [No, i'm not suggesting an unlimited number of threads, I know there are os limits and that mozilla would be *very* unhappy if it couldn't get threads for other important things. i'm just playing devil's advocate]

Steve Freeland

Comment 46

•

22 years ago

Attached patch Parallelized DNS lookups patch, attempt #2 — Details — Splinter Review

Updated patch, fixes a few problems with the original. As before, there is no main long-lived DNS thread. The worker threads are unjoinable and there is in fact *no* limit the number of threads that can be launched simultaneously. I guess I'll fix that next, but due to the great unlikeliness of running out of threads, this patch should be perfectly fine for testing in the meantime.

Attachment #124699 - Attachment is obsolete: true

Attachment #124700 - Attachment is obsolete: true

Jo Hermans

Updated

•

22 years ago

Blocks: 154816

Steve Freeland

Comment 47

•

22 years ago

I'm nearly done with my patch, just testing for leaks at the moment. In fact I think I've found a pre-existing one; can someone with a better knowledge of the codebase confirm or deny? At line 1520 of nsDnsService.cpp, in function nsDNSService::Lookup(): NS_ADDREF(request); // for caller The request object in question is returned via a pointer argument passed to the function and the comment seems to imply that the caller should call the corresponding RELEASE(); as well, it doesn't seem to happen with the DNS code itself. However in the code that use the DNS service (nsSocketTransport2.cpp and nsLDAPConnection.cpp) I can't find the RELEASE() either. valgrind doesn't like mozilla all at once, but it's reporting the request object leaked when the DNS service is used from a small test program. So am I missing something or is this really a leak? For the moment I'll take that ADDREF() out of the code I'm working on and see if it causes any problems.

Darin Fisher

Assignee

Comment 48

•

22 years ago

*** Bug 213160 has been marked as a duplicate of this bug. ***

Darin Fisher

Assignee

Comment 49

•

22 years ago

-> me

Assignee: gordon → darin

Status: ASSIGNED → NEW

Darin Fisher

Assignee

Comment 50

•

22 years ago

this bug is fixed by the patch for bug 205726.

Status: NEW → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Target Milestone: mozilla1.4alpha → mozilla1.6alpha

Ben Woolley

Comment 51

•

22 years ago

This is regarding comment 28. Now that this bug is fixed, I can send the $100 I pledged. Now that Mozilla.org is a foundation, would any of the involved developers mind if I just donated to Mozilla.org?

Darin Fisher

Assignee

Comment 52

•

22 years ago

ben: yes, please donate your money to the mozilla foundation instead! ;-)

Sebastian Späth

Comment 53

•

22 years ago

Thanks for solving this. I'll send my promised $20 to the foundation as well...

benc

Comment 54

•

22 years ago

VERIFIED/fixed per darin.

Status: RESOLVED → VERIFIED

Mats Larsson

Comment 55

•

22 years ago

In which version is it fixed?

Matthias Versen [:Matti]

Comment 56

•

22 years ago

in 1.6Alpha or later builds

Tom McLaughlin

Comment 57

•

21 years ago

This problem does not appear to be fixed on at least FreeBSD 4.9. First noticed this issue with firefox 0.8 so I tested mozilla 1.6 and 1.7a and mozilla still appears to have the same problems. I ran a packet capture with ethereal while trying to load www.allmusic.com and it shows multiple type AAAA queries being made that ultimately receive a "Server failed" response. Eventually a type A request is made which receives back a good response. This site takes on average almost 4 minutes to load in its entirety. I also found that if I hit stop and then try to go to another site mozilla will continue to try and query the old site for a period of time before trying the new site. A few notes about recreating this on FreeBSD. The resolver in the -CURRENT branch is now different from the one in 4.9 and the -STABLE branch. I have not had a chance to test this bug on -CURRENT. Also, INET6 being enabled in the kernel (default) and net.inet6.ip6.auto_linklocal not being set to 0 in /etc/sysctl.conf will cause the problem to be more evident.

Aleksander Adamowski

Comment 58

•

21 years ago

There's a new bug 233366 that covers an issue that seems related to this one. Please migrate your CC's if you experience similar problems as mentioned there.

Jerry Baker

Updated

•

20 years ago

No longer blocks: majorbugs

Asynchronous DNS lookups for UNIX. 23 years ago Peter Haight 7.17 KB, patch		Details \| Diff \| Splinter Review
Preliminary patch: Parallelized DNS lookups for XP_UNIX 22 years ago Steve Freeland 10.00 KB, patch		Details \| Diff \| Splinter Review
Preliminary patch: Parallelized DNS lookups for XP_UNIX 22 years ago Steve Freeland 10.00 KB, patch		Details \| Diff \| Splinter Review
Parallelized DNS lookups patch, attempt #2 22 years ago Steve Freeland 11.16 KB, patch		Details \| Diff \| Splinter Review