after landing bug 324397, the bm-xserve08 firefox tinderbox (MacOSX Darwin 8.8.4) started timing out during Tp tests. this turned out to be because the Tp measurement had increased 400%. (yes, not 4%, not 40%, but 400%!). all other tinderboxen and platforms were unaffected.
after about five hours of trying various things, we discovered this went away upon commenting out the |if| block here:
which means that the PR_StringToNetAddr call in IsIPAddress is responsible for the increase. this is a repeatable increase, per these Tp test results (search for "LayoutPerformanceTest"):
normal run: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1195451340.1195453072.29402.gz
bad run #1: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1195459860.1195461911.17154.gz
bad run #2:
note how some of the times in one bad run fluctuate wildly for a given site, e.g. 257 -> 4178, but that this fluctuation is repeated perfectly in the other.
so, the question is, what's causing it? i see three options:
1) there's a problem with bm-xserve08's configuration.
2) there's a bug in that particular release of OSX.
3) there's a bug in the NSPR code. (firefox trunk currently pulls
neither of the other two Darwin 8.8.4 firefox tinderboxen run Tp tests, so it's not immediately obvious if it's 1. 2 and 3 probably amount to the same thing, since the fix would be in NSPR. maybe someone who knows the code has some suggestions, particularly on how we could prove or disprove 1?
So, I can reproduce this on my OS X box.
PR_StringToNetAddr is just a wrapper around pr_StringToNetAddrGAI(), which I timed with gettimeofday()...
The fastest was a "127.0.0.1" string at startup was processed in 53 usec.
The other calls (with "www.mozilla.org") sometimes took ~500usec, sometimes ~100500 usec, and sometimes even longer. Some debug output, with times in usec...
---------> 537 (www.mozilla.org)
---------> 1634 (www.mozilla.org)
---------> 583 (www.mozilla.org)
---------> 380 (www.mozilla.org)
---------> 362 (www.mozilla.org)
---------> 335 (www.mozilla.org)
---------> 730 (www.mozilla.org)
---------> 616 (www.mozilla.org)
---------> 100431 (www.mozilla.org)
---------> 100469 (www.mozilla.org)
---------> 100530 (www.mozilla.org)
---------> 100445 (mozilla.org)
---------> 583 (www.mozilla.org)
---------> 100674 (mozilla.org)
---------> 660 (www.mozilla.org)
---------> 1899391 (mozilla.org)
Notice the odd clustering of times. It almost looks like something is actually trying to do a DNS resolution of the name... pr_StringToNetAddrGAI() is calling getaddrinfo(), which can do DNS resolution, but NSPR is setting hints.ai_flags = AI_NUMERICHOST which should disable that.
Not sure what the problem is just yet.
I remember this performance problem was reported after I checked in the patch
in bug 34843, but I can't find that Bugzilla comment or email.
This is a performance problem of getaddrinfo. We should report it to Apple.
As a workaround, we can use our old function (now named pr_StringToNetAddrFB)
to convert IPv4 addresses. We only need getaddrinfo to convert certain IPv6
addresses (those with scope ids).
i found bug 222031, which disabled getaddrinfo for OSX 10.2 and below (if i'm reading it right). in particular, see bug 222031 comment 11 and bug 222031 comment 15, where Apple apparently say they fixed 10.3 and above to not hit DNS. this may or may not be true, but evidently in 10.4 the problem is back.
wtc, would the fix you suggest be extending the patch in that bug, to cover (at least) 10.4? or just disabling it entirely? i'm wary of using getaddrinfo for any kind of addresses if it does this... mozilla code that uses PR_StringToNetAddr still needs to be performant with ipv6 addresses.
This bug has nothing to do with bug 222031.
getaddrinfo with hints.ai_flags = AI_NUMERICHOST should be
fast because it doesn't need to do DNS resolution.
MacOSX does not do hostname lookups like other OSes:
I wonder what LookupOrder the tinderboxen use. Should help to nail down which of [Cache NI DS DNS FF] are causing the slowness Justin Dolske found.
Looking at the URL above and the output of lookupd -configuration and running tests with different content in /etc/lookupd/hosts should do the trick.
Hope this helps/wasn't obvious to those working on this.
(Sure helped me when I couldn't figure out why my /etc/hosts changes weren't having any effect!)
I'm not sure any of that (comment #5) is relevant. I haven't fiddled with my OS X config, and I doubt the tinderboxen have either. The POSIX spec / RFC 3493 is fairly blunt: "This flag shall prevent any type of name resolution service (for example, the DNS) from being invoked."
OS X is apparently broken in this respect, so we'll have to work around the problem.
I noticed "The implementation of getaddrinfo() is not thread-safe." in the man page, even though the API doesn't look inherently thread-unsafe. But some quick debug checking conforms that although some calls are made from different threads, they're never interleaved.
(In reply to comment #4)
hrm, so - how do we move forward here? i don't really know enough about the guts of this to roll a patch. :/
wtc, do you have any time to make a fix here? or failing that, any ideas on the fix? (i don't quite understand comment 2; are you saying you still want to use getaddrinfo in the ipv6 case, and if so, will ipv6 be subject to poor performance? is there a way to fix this that avoids getaddrinfo altogether, given that it's ignoring AI_NUMERICHOST?)
Created attachment 293128 [details] [diff] [review]
Dan, please review and test this patch. If this patch
fixes the performance problem, it is a strong evidence
that getaddrinfo with AI_NUMERICHOST is slow and we should
report the problem to Apple.
Our own pr_StringToNetAddrFB function can convert IPv4
address strings and most IPv6 address strings. It cannot
convert IPv6 address strings with scope id's. Scope id's
are indicated by a percent sign (%) followed by either a
number or a name, so we can identify such IPv6 address
strings quickly by searching for '%' in the address string.
I'm surprised that AI_NUMERICHOST would be expected to handle scope IDs either, since they have to involve resolution of the scope IDs, no?
AI_NUMERICHOST should handle scope IDs, otherwise we can
just use the deprecated inet_pton function of RFC 2553.
Scope IDs can be represented in two forms:
1. Numbers. For example, "%1".
2. Names. For example, "%en0". I believe these are just the
names of the network interfaces, and therefore converting
these to (integer) scope IDs should not require DNS resolution.
For more info on the textual representation of scope IDs, see
RFC 4007 IPv6 Scoped Address Architecture, Section 11.
With this patch, times on OS X with network.cookie.cookieBehavior set to "1" [as in the bug that spawned this one] seems fine to me. Previously, page loads were very noticeably slower.
With the same debuggin I added as in comment 1, I'm seeing times between 0 and 3 usec. [Faster even than the "normal" ~500 usec calls from before.]
Justin: thanks for testing this patch. Do you know how to
file a Mac OS X bug report or report this problem to someone
I have a login. It would be good if someone running OS X 10.5 filed the report (I'm on 10.4).
It's a closed system; you can't see bugs others have reported.
Matthew: thanks. I submitted a bug report at https://bugreport.apple.com.
The tracking number for my bug report is Bug ID# 5651268, the title is
"getaddrinfo with hints.ai_flags = AI_NUMERICHOST is slow".
Comment on attachment 293128 [details] [diff] [review]
r=dwitte, though maybe amend the comment to cite DNS lookups rather than just being slow? (and maybe this bug# too, if that's NSPR style). i'm also curious how common scope id's are, both overall and relative to ipv6 addresses, in case you happen to know :)
if this fix goes in, do you know when it'll seep through to the 1.9 trunk? if it makes it in for one of the betas (b3 freeze is estimated mid-jan), that'd be ideal.
Apple confirimed that this is a known issue. The original bug number is Bug ID# is 3801604.
Dan, I will add the bug numbers in the comment. Are we sure DNS lookups are being done?
If not, it's better to only say what we know.
I am not familiar with IPv6 scope IDs. I believe that IPv6 scope IDs are only used in th
non-global addresses ("link-local", "site-local", etc.)
Created attachment 293789 [details] [diff] [review]
Proposed patch (as checked in)
I checked in this patch on the NSPR trunk for NSPR 4.7.
The only change I made is to add the bug number.
Checking in prnetdb.c;
/cvsroot/mozilla/nsprpub/pr/src/misc/prnetdb.c,v <-- prnetdb.c
new revision: 3.55; previous revision: 3.54
Created attachment 293794 [details] [diff] [review]
Update the NSPR tag to NSPR_HEAD_20071218.
This new NSPR tag includes the following bug fixes:
1. bug 404399 (this bug)
2. bug 354248 (better logging output, Mac OS X only)
3. bug 400957 (memory leak fix)
4. bug 404272 (define the HAVE_LONG_LONG macro only if it's not yet defined)
5. bug 401768 (Mac OS X 10.5 Leopard build fix)
These bug fixes all have low risk.
wtc, what about bug 405290?
Reed: I omitted bug fixes that modified files or code not being
used by Mozilla clients. Bug 405290 is one of those bugs. It
modified mozilla/nsprpub/config/nspr.m4, which is not used by
the Mozilla clients. The Mozilla clients have its own copy
mozilla/build/autoconf/nspr.m4. For extra assurance I diff'ed
these two nspr.m4 files and verified that they are the same.
Ok, thanks for checking! :)
Comment on attachment 293794 [details] [diff] [review]
Update the NSPR tag to NSPR_HEAD_20071218.
This bug is blocking1.9+, so you don't need approval. :)
I updated the NSPR tag on the Mozilla trunk for Firefox 3 Beta 3.
Checking in client.mk;
/cvsroot/mozilla/client.mk,v <-- client.mk
new revision: 1.359; previous revision: 1.358
Kai, Nathan, Noriko, please note the patch in this bug, which only
affects Mac OS X but is related to IPv6 support. You may want to
apply this patch to your NSPR package.