Closed Bug 239358 Opened 20 years ago Closed 20 years ago

DNS: Reverse lookups are degrading performance

Categories

(Core :: Networking, defect)

x86
Linux
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.8beta1

People

(Reporter: bugzilla, Assigned: darin.moz)

References

()

Details

(Keywords: fixed-aviary1.0, fixed1.7.5, perf, Whiteboard: [patch] [lin14523])

Attachments

(4 files, 11 obsolete files)

1.18 KB, text/plain
Details
6.67 KB, patch
Details | Diff | Splinter Review
39.24 KB, patch
bryner
: superreview+
chofmann
: approval1.7.5+
Details | Diff | Splinter Review
1.37 KB, patch
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040114
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040114

Here's a tcpdump run and some comments to demonstrate the problem

scampbell:~ # tcpdump -n -i eth0 port 53
tcpdump: listening on eth0
09:58:11.094628 10.8.178.10.32785 > 10.8.171.87.53:  47349+ AAAA? www.google.com
. (32) (DF)
09:58:11.119236 10.8.171.87.53 > 10.8.178.10.32785:  47349 1/0/0 CNAME[|domain] (DF)
09:58:11.119537 10.8.178.10.32785 > 10.8.171.87.53:  47350+ A? www.google.com. (
32) (DF)
09:58:11.121283 10.8.171.87.53 > 10.8.178.10.32785:  47350 3/1/0 CNAME[|domain] (DF)
09:58:11.121571 10.8.178.10.32785 > 10.8.171.87.53:  47351+ PTR? 104.167.233.64.
in-addr.arpa. (45) (DF)
        four seconds elapse while we lookup the reverse dns for the google site
        This seems unnecessary and disfunctional
09:58:16.121846 10.8.178.10.32785 > 10.8.171.87.53:  47351+ PTR? 104.167.233.64.
in-addr.arpa. (45) (DF)
09:58:21.123186 10.8.178.10.32785 > 10.8.171.87.53:  47352+ AAAA? www.google.com
. (32) (DF)
09:58:21.147515 10.8.171.87.53 > 10.8.178.10.32785:  47352 1/0/0 CNAME[|domain] (DF)
09:58:21.148001 10.8.178.10.32785 > 10.8.171.87.53:  47353+ PTR? 99.167.233.64.i
n-addr.arpa. (44) (DF)
        We seem to try again for another 5 seconds lost
09:58:26.148848 10.8.178.10.32785 > 10.8.171.87.53:  47353+ PTR? 99.167.233.64.i
n-addr.arpa. (44) (DF)
09:58:31.150124 10.8.178.10.32785 > 10.8.171.87.53:  47354+ AAAA? www.google.com
. (32) (DF)
09:58:31.175291 10.8.171.87.53 > 10.8.178.10.32785:  47354 1/0/0 CNAME[|domain] (DF)
09:58:39.130214 10.8.171.87.53 > 10.8.178.10.32785:  47351 ServFail 0/0/0 (45) (DF)
09:58:39.130239 10.8.171.87.53 > 10.8.178.10.32785:  47351 ServFail 0/0/0 (45) (DF)
09:58:49.131367 10.8.171.87.53 > 10.8.178.10.32785:  47353 ServFail 0/0/0 (44) (DF)
09:58:49.131405 10.8.171.87.53 > 10.8.178.10.32785:  47353 ServFail 0/0/0 (44) (DF)
        Ultimately it took us 38 seconds to open the web page, almost all time
was waiting
        on reverse lookup records that do not seem to serve a usefull purpose.


Reproducible: Always
Steps to Reproduce:
1.Start with a clean DNS cache and new copy of mozilla
2 Go to the google site
3.Watch as things are slow to occur initially, tcpdump is great help

Actual Results:  
It took 38 seconds to open the page.  See the attached and annoted tcpdump report

Expected Results:  
Opened the page immediately, I suspect there is no good reason to do the reverse
lookups that appear to be causing the problem.

I've put this to 'Major' as it will likely cause many bad perceptions about the
excellent Mozilla software.
I have also sent an email to the google folks informing them of their missing
reverse zone. I would think we must expect that such 'bad' dns situations are
fairly common and need to handle them gracefully but google -may- become a bad
test site for this as they may fix the problem on their side soon.
*** Bug 239357 has been marked as a duplicate of this bug. ***
Assignee: general → darin
Component: Browser-General → Networking
QA Contact: general → benc
Why do we do reverse lookups?
As of this morning, 4/6/2004 it appears Google has had a reverse zone (albeit
unpopulated) created, as predicted they are no longer a good test site.
fortunately, www.nytimes.com, my original test case, still provides no reverse
lookup.
I see the reverse lookup the first time I try www.nytimes.com, but if I close
Mozilla and try again, the reverse lookup is not performed, although the forward
lookup is still made.  I'm not running a proxy or any DNS caching (nscd).

tested with linux trunk 2004040408
Andrew's experience seems unique, I have several folks at the local LUG
(www.mdlug.org), as well as myself,  who all can reproduce this on a variety of
linux distros with Mozilla and Firebird.
It's likely that the reverse lookups you're seeing are the result of tcpdump
trying to resolve IP addresses for display.  I don't see them when using tcpdump -n.

I'm going to mark WFM; please re-open if you can reproduce with tcpdump -n.
Status: UNCONFIRMED → RESOLVED
Closed: 20 years ago
Resolution: --- → WORKSFORME
We did use -n on the original tcpdump but here is a new example or mozilla doing
reverse lookups (and failing in this case as www.nytimes.com has no reverse zone).
scampbell:~ # tcpdump -n -i eth0 port 53
tcpdump: listening on eth0
07:55:12.608191 172.17.5.73.32782 > 172.17.4.129.53:  15626+ AAAA?
www.nytimes.com. (33) (DF)
07:55:12.609193 172.17.4.129.53 > 172.17.5.73.32782:  15626 0/1/0 (79)
07:55:12.609372 172.17.5.73.32782 > 172.17.4.129.53:  15627+ AAAA?
www.nytimes.com.seqnt.com. (43) (DF)
07:55:12.610740 172.17.4.129.53 > 172.17.5.73.32782:  15627 NXDomain* 0/1/0 (111)
07:55:12.619182 172.17.5.73.32782 > 172.17.4.129.53:  15628+ A? www.nytimes.com.
(33) (DF)
07:55:12.619382 172.17.4.129.53 > 172.17.5.73.32782:  15628 4/0/0 A
199.239.137.245,[|domain]
07:55:12.619830 172.17.5.73.32782 > 172.17.4.129.53:  15629+ PTR?
245.137.239.199.in-addr.arpa. (46) (DF)
07:55:12.621196 172.17.4.129.53 > 172.17.5.73.32782:  15629 NXDomain 0/1/0 (103)
07:55:12.623749 172.17.5.73.32782 > 172.17.4.129.53:  15630+ AAAA?
www.nytimes.com. (33) (DF)
07:55:12.624360 172.17.4.129.53 > 172.17.5.73.32782:  15630 0/1/0 (79)
07:55:12.630070 172.17.5.73.32782 > 172.17.4.129.53:  15631+ AAAA?
www.nytimes.com.seqnt.com. (43) (DF)
07:55:12.631437 172.17.4.129.53 > 172.17.5.73.32782:  15631 NXDomain* 0/1/0 (111)
07:55:12.631821 172.17.5.73.32782 > 172.17.4.129.53:  15632+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
07:55:17.632438 172.17.5.73.32783 > 172.17.4.128.53:  15632+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
07:55:22.633391 172.17.5.73.32782 > 172.17.4.129.53:  15632+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
07:55:27.634397 172.17.5.73.32783 > 172.17.4.128.53:  15632+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
07:55:28.585931 172.17.4.129.53 > 172.17.5.73.32782:  15632 ServFail 0/0/0 (46)
07:55:32.635730 172.17.5.73.32783 > 172.17.4.129.53:  15633+ AAAA?
www.nytimes.com. (33) (DF)
07:55:32.636700 172.17.4.129.53 > 172.17.5.73.32783:  15633 0/1/0 (79)
07:55:32.636924 172.17.5.73.32783 > 172.17.4.129.53:  15634+ AAAA?
www.nytimes.com.seqnt.com. (43) (DF)
07:55:32.638300 172.17.4.129.53 > 172.17.5.73.32783:  15634 NXDomain* 0/1/0 (111)
07:55:32.650682 172.17.5.73.32783 > 172.17.4.129.53:  15635+ PTR?
245.136.239.199.in-addr.arpa. (46) (DF)
07:55:32.967714 172.17.4.128.53 > 172.17.5.73.32783:  15632 ServFail 0/0/0 (46)
07:55:37.651421 172.17.5.73.32784 > 172.17.4.128.53:  15635+ PTR?
245.136.239.199.in-addr.arpa. (46) (DF)
07:55:38.587303 172.17.4.129.53 > 172.17.5.73.32782:  15632 ServFail 0/0/0 (46)
Status: RESOLVED → UNCONFIRMED
Resolution: WORKSFORME → ---
You're right, I misread the original post, sorry.  I still can't reproduce it on
my machine.
Steven: the existence of the reverse lookups in DNS doesn't necessarily mean
that Mozilla is making them.  Do you have a proxy/firewall/etc that might be
doing a lookup?  To see the DNS lookups Mozilla is doing, you can do:

setenv NSPR_LOG_MODULES nsHttpResolver:5
setenv NSPR_LOG_FILE /tmp/dns.log

mozilla http://www.nytimes.com/
(quit once the page is loaded)
I really don't think it's something I'm running, I am not running nsc or named
or any other software that might interfere with DNS.  We (www.mdlug.org) have
witnessed this with many different versions of Mozilla/Firebird on many platform
(it was actually conversations we had amongst ourselves that prompted me to look
for the cause).
Also, I just tried:
scampbell@scampbell:~> NSPR_LOG_MODULES=nsHttpResolver:5
scampbell@scampbell:~> NSPR_LOGFILE=/tmp/dns.log
scampbell@scampbell:~> export NSPR_LOGFILE
scampbell@scampbell:~> export NSPR_LOG_MODULES
scampbell@scampbell:~> mozilla
But this did not generate a /tmp/dns.log file.
Again, my specific example is Suse 9.0 (patched up to date) using Mozilla 1.6 -
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040114
Andrew Schultz suggested you set NSPR_LOG_FILE, but you set NSPR_LOGFILE, so it
may be a simple typo (I haven't tried it with either one, just observing the
discrepency)
yes, should be NSPR_LOG_FILE

when not setting NSPR_LOG_FILE but setting NSPR_LOG_MODULES, logging will happen
to stdout (or stderr, not sure)
Thanks, I corrected that but I got an empty log file.  I have asked my local lug
to all try this and report back including tcpdumps of dns, copies of
/etc/resolv.conf and ps-ef.  I'll see what I get back on that.

Sorry for the bugspam folks.
(In reply to comment #12)
> scampbell@scampbell:~> NSPR_LOG_MODULES=nsHttpResolver:5

Hmm, actually looking at the sourcecode, it seems that this is called
nsHostResolver, not nsHttpResolver.
I'm still puzzled about where the reverse lookups come from.

In the past, Mozilla has sent reverse lookups spuriously, and that has been from
using IP addresses. So if you really think mozilla is the problem, trace a URL
that goes to a fixed address (use nslookup to convert and then do http://1.2.3.4).

Use a page that doens't load much from elsewhere.
Summary: Reverse DNS lookups are degrading performance → DNS: Reverse lookups are degrading performance
Testing with a fixed IP there were no reverse lookups generated.  We're seeing
lots of evidence that this is tied to the linux distributions.  Suse 9.0 or
Mandrake 10.0 loaded with Mozilla 1.6 (sorry I don't have specific versions)
demonstrate the problem, however Debian does not.  This is actually watching for
dns traffic using tcpdump. We've eliminated all local nameservers and cachers
from the systems so it's not there. Also, we've noted that some ISPs,
particularily WOW seem to be able to give a negative dns answer quickly masking
the problem. Testing with Konqueror on the same systems does not exhibit the
reverse lookup and the few I've tried (not everyone has tried this).  At this
point it is difficult to say where this problem lies, I'll keep digging for
usuable information/evidence.
There maybe variances in the way DNS resolvers work, either through OS libraries
or vendor configuration.
I see the same problem with firefox 0.8 on my Debian sarge system. Trying to log
DNS lookups to /tmp/dns.log as described also gets me an empty log: what's the
secret to getting this to work?

The problem also shows up if I use another machine running squid as a proxy,
which is strange because requests go to squid as URLs, not IP addresses which
would have to be looked up. Using ethereal on port 3128 of the machine running
firefox doesn't show anything but the URL requests going out. And it takes even
longer to display the page this way.

Yes, I'm running ethereal with name lookups turned off.

 .....Ron
One more thing:

Doing "netstat -anp" while waiting for the page to load shows that the process
doing the DNS lookups is indeed that of firefox.
I think I've found the problem.

File nsprpub/pr/src/misc/prnetdb.c calls getaddrinfo, using (among other things):

> hints.ai_flags = AI_CANONNAME;

   (around line 2085).

According to the getaddrinfo man page,

> ai_canonname is set to point to the official name of the host,
> if  ai_flags  in hints  includes  the AI_CANONNAME flag.

Under some versions of Linux, at least, this seems to cause the resolver library
to do a reverse lookup immediately after doing the forward lookup. This would
happen every time mozilla code calls PR_GetAddrInfoByName(), and indeed the same
problem occurs with mozilla, galeon and epiphany (the only other browsers I've
tried). It's easy to test this: create a simple program that calls getaddrinfo()
on www.nytimes.com and check the response time with and without the AI_CANONNAME
flag. There's still a huge time delay with the flag set, and there's no mozilla
code involved here.

I've done a firefox build with the code setting that flag commented out, and can
confirm that loading nytimes.com is lots faster. Furthermore, monitoring port 53
outgoing on the box with ethereal shows no reverse lookups while that page is
loading.

The remaining problem, of course, is what to do if anything uses the
ai_canonname member, which will now be empty if the flag is not set. A grep
through the firefox code seems to show that the only routine using it is
PR_GetCanonNameFromAddrInfo(), and that in turn doesn't seem to be called by
anything. It'd probably be best to rewrite that routine so it calls
getaddrinfo() itself with the AI_CANONNAME set. 

Other problems? For a start,

- This box dual-boots Linux and Windows 2000. I don't see anything like this
problem on win2k.

- It's not clear why the problem arises on some boxes and not others.

Sigh.

 .....Ron
(In reply to comment #22)

> - It's not clear why the problem arises on some boxes and not others.

If #246361 is the same as this one, it occurs on Debian with the 2.6.3 kernel
but not on Debian with the 2.4.18 kernel. What about the kernel versions on the
boxes where you have seen this bug?
Steven: Can you try to execute
http://mcsmurf.milten.lima-city.de/linux-test/test and
http://mcsmurf.milten.lima-city.de/linux-test/test2 ? The source can be found
under http://mcsmurf.milten.lima-city.de/linux-test/test.c and
http://mcsmurf.milten.lima-city.de/linux-test/test2.c
Please see how long the first program needs to execute and how long the second
one needs. This programs trys to lookup www.nytimes.com, one time with the flag
set (test) and one time without it (test2).
It should be fairly straightforward to avoid resolving the CNAME in most cases.
 There are times when we need it, but those can be special cased.

I'll wait for feedback from mcsmurf's testcases.
I ran the tests.  'test' does the reverse lookup, 'test2' does not. 
 
I'd like to suggest to those who try this out to please run tcpdump on port 53
'tcpdump -n port 53' and watch for 'PTR?' (for example: PTR?
245.137.239.199.in-addr.arpa.).  Remember that the negative cache may still hold
 failed PTR? request which will mask the result when timeing.  Also note that
some ISP's (I note particularily WOW up here in SE Michigan) seems to mask this
by failing PTR? requests quickly.  If we only time it then we will have
unreliable results, we really need to see if the machine trys the PTR? (reverse
lookup) query.

Here's the run times:
> time ./test
real    0m1.652s
user    0m0.003s
sys     0m0.001s

> time ./test2
real    0m0.003s
user    0m0.001s
sys     0m0.000s

> time ./test
real    0m0.333s
user    0m0.003s
sys     0m0.002s

Notice how the second run of ./test is pretty quick, the negative query got
cached.  I'll post this out the our local LUG and see if everyone agrees that
this is probably the culprit.  So I think you may have put your finger on the
problem, I just caution us on our testing procedure so we don't blow right by
the answer.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Keywords: helpwanted, perf
Target Milestone: --- → mozilla1.8alpha2
(In reply to comment #24)
> Steven: Can you try to execute
> http://mcsmurf.milten.lima-city.de/linux-test/test and
> http://mcsmurf.milten.lima-city.de/linux-test/test2 ? The source can be found
> under http://mcsmurf.milten.lima-city.de/linux-test/test.c and
> http://mcsmurf.milten.lima-city.de/linux-test/test2.c
> Please see how long the first program needs to execute and how long the second
> one needs. This programs trys to lookup www.nytimes.com, one time with the flag
> set (test) and one time without it (test2).

I'm running Arch Linux, kernel 2.6.6 and 

Mozilla 1.6
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040117

I ran into this while using a caching DNS on my firewall. Resolving NYT was
taking 20 - 30 sec.  I bypassed my DNS by modifying resolv.conf and went
directly to Comcast's DNS.  This reduced the times to a second, but inspection
with tcpdump shows server failure on PTR lookups, just very quickly.

==== test1 ====

# tcpdump -ni epi1 port 53
tcpdump: listening on ep1
17:16:18.516403 24.61.140.46.9977 > 204.127.202.19.53:  63965+ AAAA?
www.nytimes.com. (33) (DF)
17:16:18.534299 204.127.202.19.53 > 24.61.140.46.9977:  63965 0/1/0 (79) (DF)
17:16:18.534839 24.61.140.46.9977 > 204.127.202.19.53:  63966+ AAAA?
www.nytimes.com. (33) (DF)
17:16:18.551648 204.127.202.19.53 > 24.61.140.46.9977:  63966 0/1/0 (79) (DF)
17:16:18.552138 24.61.140.46.9977 > 204.127.202.19.53:  63967+ A?
www.nytimes.com. (33) (DF)
17:16:18.567887 204.127.202.19.53 > 24.61.140.46.9977:  63967 4/2/2 A
199.239.137.200,[|domain] (DF)
17:16:18.568710 24.61.140.46.9977 > 204.127.202.19.53:  51829+ PTR?
200.137.239.199.in-addr.arpa. (46) (DF)
17:16:18.711439 204.127.202.19.53 > 24.61.140.46.9977:  51829 NXDomain 0/1/0
(103) (DF)
17:16:18.712074 24.61.140.46.9977 > 204.127.202.19.53:  51830+ AAAA?
www.nytimes.com. (33) (DF)
17:16:18.732767 204.127.202.19.53 > 24.61.140.46.9977:  51830 0/1/0 (79) (DF)
17:16:18.733260 24.61.140.46.9977 > 204.127.202.19.53:  51831+ AAAA?
www.nytimes.com. (33) (DF)
17:16:18.750722 204.127.202.19.53 > 24.61.140.46.9977:  51831 0/1/0 (79) (DF)
17:16:18.751324 24.61.140.46.9977 > 204.127.202.19.53:  51832+ PTR?
245.137.239.199.in-addr.arpa. (46) (DF)
17:16:18.766845 204.127.202.19.53 > 24.61.140.46.9977:  51832 NXDomain 0/1/0
(103) (DF)
17:16:18.767459 24.61.140.46.9977 > 204.127.202.19.53:  51833+ AAAA?
www.nytimes.com. (33) (DF)
17:16:18.785336 204.127.202.19.53 > 24.61.140.46.9977:  51833 0/1/0 (79) (DF)
17:16:18.785841 24.61.140.46.9977 > 204.127.202.19.53:  51834+ AAAA?
www.nytimes.com. (33) (DF)
17:16:18.810603 204.127.202.19.53 > 24.61.140.46.9977:  51834 0/1/0 (79) (DF)
17:16:18.811146 24.61.140.46.9977 > 204.127.202.19.53:  51835+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
17:16:18.828220 204.127.202.19.53 > 24.61.140.46.9977:  51835 ServFail 0/0/0
(46) (DF)
17:16:18.828943 24.61.140.46.48892 > 216.148.227.79.53:  51835+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
17:16:18.909172 216.148.227.79.53 > 24.61.140.46.48892:  51835 ServFail 0/0/0
(46) (DF)
17:16:18.909629 24.61.140.46.9977 > 204.127.202.19.53:  51835+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
17:16:18.926686 204.127.202.19.53 > 24.61.140.46.9977:  51835 ServFail 0/0/0
(46) (DF)
17:16:18.927119 24.61.140.46.48892 > 216.148.227.79.53:  51835+ PTR?
200.136.239.199.in-addr.arpa. (46) (DF)
17:16:19.018154 216.148.227.79.53 > 24.61.140.46.48892:  51835 ServFail 0/0/0
(46) (DF)
17:16:19.018790 24.61.140.46.9977 > 204.127.202.19.53:  51836+ AAAA?
www.nytimes.com. (33) (DF)
17:16:19.041374 204.127.202.19.53 > 24.61.140.46.9977:  51836 0/1/0 (79) (DF)
17:16:19.041920 24.61.140.46.9977 > 204.127.202.19.53:  51837+ AAAA?
www.nytimes.com. (33) (DF)
17:16:19.063247 204.127.202.19.53 > 24.61.140.46.9977:  51837 0/1/0 (79) (DF)
17:16:19.063848 24.61.140.46.9977 > 204.127.202.19.53:  51838+ PTR?
245.136.239.199.in-addr.arpa. (46) (DF)
17:16:19.081715 204.127.202.19.53 > 24.61.140.46.9977:  51838 ServFail 0/0/0
(46) (DF)
17:16:19.082147 24.61.140.46.48892 > 216.148.227.79.53:  51838+ PTR?
245.136.239.199.in-addr.arpa. (46) (DF)
17:16:19.162845 216.148.227.79.53 > 24.61.140.46.48892:  51838 ServFail 0/0/0
(46) (DF)
17:16:19.163351 24.61.140.46.9977 > 204.127.202.19.53:  51838+ PTR?
245.136.239.199.in-addr.arpa. (46) (DF)
17:16:19.180410 204.127.202.19.53 > 24.61.140.46.9977:  51838 ServFail 0/0/0
(46) (DF)
17:16:19.180830 24.61.140.46.48892 > 216.148.227.79.53:  51838+ PTR?
245.136.239.199.in-addr.arpa. (46) (DF)
17:16:19.261321 216.148.227.79.53 > 24.61.140.46.48892:  51838 ServFail 0/0/0
(46) (DF)
17:16:19.261892 24.61.140.46.9977 > 204.127.202.19.53:  51839+ AAAA?
www.nytimes.com. (33) (DF)
17:16:24.261723 24.61.140.46.55773 > 216.148.227.79.53:  51839+ AAAA?
www.nytimes.com. (33) (DF)
17:16:24.413212 216.148.227.79.53 > 24.61.140.46.55773:  51839* 0/1/0 (79) (DF)
17:16:24.413794 24.61.140.46.61673 > 204.127.202.19.53:  51840+ AAAA?
www.nytimes.com. (33) (DF)
17:16:24.437986 204.127.202.19.53 > 24.61.140.46.61673:  51840* 0/1/0 (79) (DF)


==== test2====

17:16:42.622400 24.61.140.46.61673 > 204.127.202.19.53:  5550+ AAAA?
www.nytimes.com. (33) (DF)
17:16:42.637652 204.127.202.19.53 > 24.61.140.46.61673:  5550 0/1/0 (79) (DF)
17:16:42.638162 24.61.140.46.61673 > 204.127.202.19.53:  5551+ AAAA?
www.nytimes.com. (33) (DF)
17:16:42.653836 204.127.202.19.53 > 24.61.140.46.61673:  5551 0/1/0 (79) (DF)
17:16:42.654343 24.61.140.46.61673 > 204.127.202.19.53:  5552+ A?
www.nytimes.com. (33) (DF)
17:16:42.670007 204.127.202.19.53 > 24.61.140.46.61673:  5552 4/2/2 A
199.239.136.200,[|domain] (DF)

trader ~$ time ./test1

real    0m0.885s
user    0m0.002s
sys     0m0.002s
trader ~$ time ./test2

real    0m0.057s
user    0m0.001s
sys     0m0.002s
It appears the pertinent code is in /sysdeps/posix/getaddrinfo.c
This is by far the largest source file in .../posix and has been
modified heavily over the last couple of years.  It's hard to follow
how it works and I'd have to use gdb to figure it out :-(

From glibc 2.3.3 BUGS:  (my Arch linux is running 2.3.2):

            List of known bugs (certainly very incomplete)
            ----------------------------------------------

Time-stamp: <02/09/30 13:49:48 drepper>

[...]

[  *]  Some of the functions which also handled IPv6 are currently broken.
       IPv6 and IPv4 lookups occasionally happen when not needed.  This
       happens in getaddrinfo() and getnameinfo().  IPv4 handling of
       these functions is OK though and there are patches available to fix
       the IPv6 code as well.
(In reply to comment #24)
> Steven: Can you try to execute
> http://mcsmurf.milten.lima-city.de/linux-test/test and
> http://mcsmurf.milten.lima-city.de/linux-test/test2 ? The source can be found
> under http://mcsmurf.milten.lima-city.de/linux-test/test.c and
> http://mcsmurf.milten.lima-city.de/linux-test/test2.c
> Please see how long the first program needs to execute and how long the second
> one needs. This programs trys to lookup www.nytimes.com, one time with the flag
> set (test) and one time without it (test2).

hi, 

i ran each test from both inside and outside out firewall, with a peek at
tcpdump.  here's what i found:

all times are user time

inside firewall:
test consistently takes 46s
test2 takes 2s, most of which looks like quad-a lookup

outside the firewall:
test takes 0.52s +- 0.03s
test2 takes 0.12s +- 0.01s
I can reproduce the performance problems on FC2.  It seems that I get 3 DNS
request packets for each hostname that I load when AI_CANONNAME is passed to
getaddrinfo.  It seems that GLIBC issues a PTR request for each of the returned
IP addresses.  I'm not sure why it is doing that since the CNAME is clearly
indicated in the response from the A request.
Interestingly enough, if I call gethostbyname instead of getaddrinfo, I only get
the first A request.  It returns the CNAME, but does not do the reverse lookups.
I don't understand why GLIBC insists on doing reverse lookups when the
AI_CANONNAME flag is passed to getaddrinfo.
I tested WinXP, and it's getaddrinfo implementation does not generate PTR
requests when passed AI_CANONNAME.
WTC suggested that we could extend NSPR's PR_GetAddrInfoByName to support a
PR_AI_NOCANONNAME flag.  We cannot use PR_AI_CANONNAME because we have to keep
binary compatibility with previous NSPR releases that assumed AI_CANONNAME.

However, this seems like a GLIBC bug to me.  Or, maybe there is something that
I'm missing.  I'm going to try to find out more on this from the GLIBC side.
I found some comments on the web suggesting that the specification for
getaddrinfo does not define precisely the meaning of AI_CANONNAME in the context
of DNS.  In some interpretations the CNAME included with a response to an A
query may only be an alias for the real canonical name.  That may explain why
GLIBC is doing reverse lookups.  Sounds to me like implementing WTC's suggestion
is the right way to go.
*** Bug 245174 has been marked as a duplicate of this bug. ***
This does look like dubious behaviour on the part of glibc. When you already
have a reasonable answer (since you got the CNAME) why do a reverse lookup,
which will probably result in lower quality information?
Apparently, the behavior of getaddrinfo(AI_CANONNAME) is quite intentional on
the part of the GLIBC development team.  At least, I was able to get some
information on this from folks at IBM.  The word on the street is: "don't pass
AI_CANONNAME if you don't need the reverse lookups."  So, that's what we're
going to need to do.  

It's a shame that we need to reverse the sense of AI_CANONNAME for NSPR, but I
guess that's our only choice.

WTC also suggested making PR_GetCanonNameFromAddrInfo do another lookup passing
AI_CANONNAME.  That might actually be a good solution for NSPR since we could
then limit the reverse lookup to only one of the IP addresses.

I'm not sure which I prefer at the moment, but I think that Mozilla has almost
no use for AI_CANONNAME.  It is only our webservices security layer that depends
on reverse DNS lookups, and at the moment at least it has no need for reverse
lookups on all IP addresses returned for a hostname.  So, there may in fact be
good reason to implement the reverse lookup inside PR_GetCanonNameFromAddrInfo.

Thoughts?
Attached patch v1 patch (obsolete) — Splinter Review
Patch for NSPR that makes us lazily resolve the canonical hostname when asked
via PR_GetCanonNameFromAddrInfo.  This seems to do the trick.  Under FC2,
resolving www.yahoo.com now results in a single AAAA request followed by a
single A request.  If PR_GetCanonNameFromAddrInfo is called (normally doesn't
happen in Mozilla), then a single PTR request is made for the first IP address
in the addrinfo structure.

I have only tested this patch under Fedora Core 2.

I use getnameinfo to implement the PTR request.  The amount of code is minimal,
and I stash the result in the ai_canonname field of the first addrinfo struct. 
I take care to free this myself before passing the addrinfo struct to
freeaddrinfo.  This solution is better than calling getaddrinfo with
AI_CANONNAME since it avoids doing reverse lookups on all returned IP
addresses.  NSPR's API does not provide the caller with a way to enumerate the
reverse lookups, and I don't think we have a need to supply that API.
Comment on attachment 151050 [details] [diff] [review]
v1 patch

NOTE: with this patch, there is no need for any necko changes.	this is because
necko doesn't call PR_GetCanonNameFromAddrInfo until its consumer calls
GetCanonicalName.  that could be a performance issue because it means that we
could be blocking the UI thread for reverse DNS lookups, but for now i think
that is acceptable.  we only use reverse lookups for the webservices security
model, which is as far as i can tell pretty much unused at the moment.
Attachment #151050 - Flags: review?(wchang0222)
I thought that each host has only one canonical name.
You seem to imply that each IP address of a host has
a canonical name.
(In reply to comment #40)
> I thought that each host has only one canonical name.
> You seem to imply that each IP address of a host has
> a canonical name.

Yeah, I was surprised too.  Try doing a reverse lookup on each of the IP
addresses returned for www.yahoo.com.  You'll find that the result is different
for each IP node!  Even more interesting is the fact that none of the results
for the reverse lookups match the value of h_name returned by gethostbyname!

So, my patch is perhaps suboptimal because we will not return the same value for
PR_GetCanonNameFromAddrInfo in the getaddrinfo and gethostbyname cases.  There
doesn't seem to be a way to make getaddrinfo return the same cname that
gethostbyname returns.
(In reply to comment #37)
> Apparently, the behavior of getaddrinfo(AI_CANONNAME) is quite intentional on
> the part of the GLIBC development team.  At least, I was able to get some
> information on this from folks at IBM.  The word on the street is: "don't pass
> AI_CANONNAME if you don't need the reverse lookups."  So, that's what we're
> going to need to do.  

That doesn't make much sense. RFC 3493 says:

   If nodename is not null, and if requested by the AI_CANONNAME flag,
   the ai_canonname field of the first returned addrinfo structure shall
   point to a null-terminated string containing the canonical name
   corresponding to the input nodename;

so ai_canonname is only defined for the first addrinfo structure. This seems to
suggest that ai_canonname is associated with the host itself and not with one of
its IP addresses.

Therefore, getting ai_canonname through a reverse lookup seems like the wrong
thing to do, since as you observe in comment #41, the reverse DNS entries to
multiple IP addresses from the same forward lookup are often (actually, most of
the time) different.

Thus if you get ai_canonname through a reverse lookup you are going to end up
with a random string (since the IP addresses for the forward lookup are usually
returned in random order by the DNS server) which usually tells you much less
than the simple CNAME would have done in the first place.

Do you know where this is discussed, or can you point me to the checkin that
changed this behaviour? It might be an idea to take this up with the glibc
developers...
Unfortunately, I don't know of any public forum where this was discussed.  Your
interpretation of the RFC makes sense to me, but unfortunately there were
several RFCs covering getaddrinfo, so perhaps GLIBC is implementing an earlier
version???
Grazie, Lorenzo.  That's exactly what I remembered
RFC 3493 says.  When Darin told me that GLIBC does
reverse lookups on all returned IP addresses, I
thought I misinterpreted RFC 3493.  We should ask
the GLIBC implementors of getaddrinfo why they did
not implement the AI_CANONNAME flag in the way RFC
3493 specifies.
I have changed the code in glibc to only fill in ai_canonname for the first
returned entry.  This was something nobody noticed so far even though POSIX
explicitly spells this out.  This does not eliminate the revserse lookups, but
now only one is made.
> I have changed the code in glibc to only fill in ai_canonname for the first
> returned entry.... This does not eliminate the revserse lookups, but
> now only one is made.

ulrich: thanks for making a fix in glibc, but ... i don't understand why even a
single reverse lookup is required.  the response to the A (or AAAA) requests
will include the desired CNAME value.  a reverse lookup on the first IP address
will not return the correct value (or so it seems from the www.yahoo.com testcase).
Ulrich, might it not be a good idea to be a bit lax in interpreting POSIX, given
the current state of the reverse DNS tree in today's Internet?

Reverse lookups quite frequently time out and often the data returned is less
useful than that which can be obtained by simply following the CNAME chain (see
the end of my comment #42). Wouldn't it be better just to do this?

After all, there's getnameinfo() if you want to do reverse lookups.
*** Bug 248585 has been marked as a duplicate of this bug. ***
*** Bug 248828 has been marked as a duplicate of this bug. ***
Target Milestone: mozilla1.8alpha2 → mozilla1.8beta
This LD_PRELOAD library removes the AI_CANONNAME flag from glibc getaddrinfo
(without modifying glibc or Firefox/Mozilla binary). This can be used to test
with the binary distibuted Firefox/Mozilla.
Someone reported on #firefox that his Comcast DNS server won't do reverse
lookups in many cases.  It sounds like a problem with Comcast, but as a result
he was simply unable to browse many websites using Firefox under Fedora Core 2.
 That kind of report makes this bug all the more serious I think.
The only time when reverse lookups are needed is for SSL certificate
verification. It should be possible to move the entire reverse lookup procedure
from the main browser/mail components to nspr and get rid of this problem with
very little ado.  
That is not true. DNS reverse lookups are used everytime a number is displayed
as a name. netstat output. The "connected to" prompt when you telnet or ftp.

As the internet grew rapidly, the DNS tree became very flat and automatically
administered. PTR records became neglected. We have had several bugs in mozilla
that would have been solvable by requiring reverse lookups, but we have found
that too many domains neglect their care.
But doesn't that boil down to the same? If you don't need reverse lookups, or if
you do need them but can't use them because PTR records are neglected, the end
result in both cases is that you shouldn't be doing them. 

I don't think it's only neglected PTR records though. When three zillion domains
and several different services are hosted on the same machine and respond on the
same IP, that one lonely PTR record for the IP becomes rather meaningless, or
even misleading. 
"Proper DNS structure" should be that there is one and only one forward, one and
only one reverse and everything else is CNAMES.   However, it is certainly a
fact that "proper DNS structure" doesn't actually exist on the internet.  I
would think we should avoid using reverse pointers except where absolutely
required so that we meet the reality rather than to aspire to get everyone else
on the planet to come up to our level :)
Keep in mind that we aren't asking the OS to do a PTR request.  We're asking for
the CNAME, which is returned from a forward lookup.  The OS should not need to
do a PTR request to acquire the CNAME for a given host.

Unfortunately, it seems like we are in a position where we either have to live
with the implementation on Linux or we have to hack around it.  Our APIs between
modules within Mozilla require us to provide the CNAME if asked.  We can't
change the NSPR API because it is frozen, so we have to look for a way to
support CNAME queries.

What to do?  Hmm...
Updated to compile on newer gcc.
Attachment #154282 - Attachment is obsolete: true
Has anyone tried to contact the glibc developers on this one? I think strong
arguments can be made for reverting to the previous behaviour.

Fist off, it's a real showstopper for IPv6 adoption. We can hack around it in
Mozilla, and we can hack around it in other apps, but how many apps do we have
to modify?

Also, the change also seems to remove the only way to retrieve a CNAME from the
DNS using the system resolver (as opposed to implementing your own DNS queries,
which is a major pain).

And are there any advantages? If this change was only made to comply with POSIX,
then why not disregard that part of the standard and implement what makes more
sense? It's not as if Linux hasn't done that before...

Could anyone take this up with the glibc developers? I can try posting to
glibc-alpha, but I don't have high hopes that will result in anything...
What previous behavior?  You seem not to understand at all what this is about.

The getaddrinfo implementation is correct.  It is not glibc's fault if the user
uses flags which cause unnecessary additional work.

Simply introduce a way to you the nspr to query without the reverse lookup by
not adding AI_CANONNAME.

There is no need to bring this up on the glibc lists, there won't be any change
since people using this flag actually want the behavior.  Fix nspr and mozilla.
(In reply to comment #59)
> What previous behavior?  You seem not to understand at all what this is about.

Ok, so this is what I understand:

- In previous glibc code AI_CANONNAME caused getaddrinfo to return the name at
the end of the CNAME chain as the canonical name
- In current glibc code AI_CANONNAME causes getaddrinfo to obtain the canonical
name through PTR queries for the IP address returned

Is this correct? If so, the current code has the following disadvantages:

1. It will time out and fail on broken DNS servers. nytimes.com is just one
example. Look at 

http://www.ripe.net/ripe/meetings/ripe-48/presentations/ripe48-dns-malone.pdf

and note the substantial percentage of broken servers that will answer A queries
ONLY (and thus no PTR queries). There is a substantial number of sites out there
which have this problem, which is why we implemented per-domain disabling of
IPv6 lookups in bug 68796.

2. It will probably provide lower quality information. CNAME chains are
typically more accurate than PTR records for the simple reason that the vast
majority of applications uses forward lookups only, and if the CNAME chain
breaks, the apps break. PTR records are often synthetic, generic or even
nonexistent, and nothing really depends on them.

3. It is not clear (to me at least) which IP address the reverse lookup is
performed on if there are multiple addresses for one name (of course there is
only one, since the RFC specifies that ai_canonname is only defined for the
first element of the result list). Is the PTR lookup done on the first IP
address returned? That would mean it is done on one of the host's IPs at random,
given most nameserver implementations. How can this be meaningful?

4. It removes the only way the app had of finding the name at the end of the
CNAME chain (short of implementing its own DNS lookups). Reverse lookups can
always be done with getnameinfo, but getting the name at the end of the CNAME
chain cannot be done using the resolver API any more.

> The getaddrinfo implementation is correct.  It is not glibc's fault if the
> user uses flags which cause unnecessary additional work.

It's correct according to POSIX (the RFC is not clear on this), but might it not
be better to fix these issues rather than be correct according to POSIX?

> Simply introduce a way to you the nspr to query without the reverse lookup by
> not adding AI_CANONNAME.

It's not that we don't want to do this in mozilla... it just seems to me that
the change in glibc's behaviour, although made in the interests of POSIX
compliance, was a change for the worse and thus might be reconsidered. Which of
course would benefit not just mozilla but all apps that use AI_CANONNAME, of
which there are probably a lot of (after all, in all other implementations of
getaddrinfo, including old glibc implementations, AI_CANONNAME was "free").

> There is no need to bring this up on the glibc lists, there won't be any 
> change since people using this flag actually want the behavior.

Are you sure? Has this issue been brought up before? How can PTR lookups be
better than following the CNAME chain?
Don't you realize how incredibly stupid all this is?

"We use this function with a flag.  It does what it is supposed to do but it
takes a while and we actually don't want the extra information.  We could remove
the flag but we don't want this.  So, change the C library to match what we
want.  This will effect all kinds of other programs but this certainly is not as
important as not changing the mozilla sources."


I can explain this only by somebody having been out in the sun for too long. 
Fix the code.  Don't use the flag if you don't want to pay the price to get the
correct behavior.
> Don't you realize how incredibly stupid all this is?

Please don't be silly.  We do need the CNAME in some cases, and the expectation
we had was that the CNAME was given to us for free (as Lorenzo explained).  So,
we chose to ask for it unconditionally.  With gethostbyname and previous
versions of getaddrinfo, it truly was free to ask for the CNAME.  Ok, I accept
that glibc has changed its implementation.  It's unfortunate that binary
compatibility is so poorly regarded, but I can understand your argument about
choosing to implement a standard.  It's unfortunate that other platforms do not
agree with glibc's implementation, but OK... I get it... they are simply behind
the times.

Mozilla will of course change to not request the CNAME.

The distinction between getaddrinfo for forward lookups and getnameinfo for
reverse lookups seemed like a clean distinction to me.  It seems odd to hide the
CNAME value returned from an A or AAAA DNS query, but again... I yield to the
standards argument.  Oh well :-/


> I can explain this only by somebody having been out in the sun for too long. 

Thanks for the warm fuzzies... much appreciated.
Attached patch v2 patch - nspr part (obsolete) — Splinter Review
This patch introduces PR_AI_NOCANONNAME for PR_GetAddrInfoByName.  When this
flag is not given, the code attempts to resolve the CNAME for the given
hostname.  It does not rely on using AI_CANONNAME with the OS's getaddrinfo
because it's clear that platforms do not agree on the meaning of that flag. 
For Mozilla's purposes, we can of adjust the meaning of the CNAME.  This patch
assumes rightly or wrongly that a reverse lookup on the first returned IP
address will be OK for Mozilla on all platforms in all cases.  I'm looking for
consistency first and foremost.

Our CNAME support only applies to the webservices security model and service
principal name creation on Windows for SPNEGO authentication.  I think reverse
lookup on the first returned IP address will be OK for those applications.
Attachment #151050 - Attachment is obsolete: true
Attached patch v2 patch - other part (obsolete) — Splinter Review
Here's the corresponding changes for the rest of the mozilla code base.
I have only tested these patches on Linux (FC2).  They may not compile or work
correctly on other platforms.  I plan to do more testing next week, but help
testing is of course welcome ;-)
Attachment #151050 - Flags: review?(wchang0222)
Lorenzo: WTC says he probably won't be able to look at these patches until the
17th or so since he is on vacation.  If you have time to review them that would
be great!
(In reply to comment #66)
> Lorenzo: WTC says he probably won't be able to look at these patches until the
> 17th or so since he is on vacation.  If you have time to review them that would
> be great!

Sure, but I won't be able to look at them before the 9th or 10th or so when I
get back home...
Attachment #155383 - Flags: review?(lorenzo)
Attachment #155384 - Flags: review?(lorenzo)
Keywords: helpwanted
Whiteboard: [patch]
After consultation with other people in the IPv6 community I decided to take
this up on libc-alpha, because I think glibc's behaviour is incorrect. It has
the disadvantages I list in comment #60, it is arguably incorrect according to
POSIX, and it is done by no other implementation of getaddrinfo that I know of.

See http://sources.redhat.com/ml/libc-alpha/2004-08/msg00080.html
Comment on attachment 155383 [details] [diff] [review]
v2 patch - nspr part

>--- pr/src/misc/prnetdb.c	28 Apr 2004 00:34:07 -0000	3.21.2.22
>+++ pr/src/misc/prnetdb.c	6 Aug 2004 21:00:43 -0000
[...]
>+/*
>+ * This flag prevents getnameinfo from returning numeric addresses.
>+ */
>+#ifndef NI_NAMEREQD
>+#define NI_NAMEREQD 8
>+#endif

Hmm... you call the system implementation of getnameinfo with the NI_NAMEREQD
flag regardless of whether the system headers define it. But if NI_NAMEREQD is
not defined in the system headers, then probably the system implementation of
getnameinfo does not support it, so it's meaningless to use it. So wouldn't it
be better to lose the #ifdef here and just get a compile error if NI_NAMEREQD
is not defined?

>+static char *_pr_strdup(const char *s)
>+{
>+    char *rv;
>+    int len = strlen(s);
[...]

Is this really needed here? :(

> static PRAddrInfo *
> pr_GetAddrInfoByNameFB(const char  *hostname,
>                        PRUint16     af,
>                        PRIntn       flags)
> {
>     PRStatus rv;
>     PRAddrInfoFB *ai;
>-    /* fallback on PR_GetHostByName */
>+
>+    /* fallback to PR_GetHostByName */
>     ai = PR_NEW(PRAddrInfoFB);
>     if (!ai) {
>         PR_SetError(PR_OUT_OF_MEMORY_ERROR, 0);
>         return NULL;
>     }
>+    ai->have_hostent = PR_FALSE;
>+
>+    /*
>+     * test if input is an IP address literal.  if so, then avoid calling
>+     * PR_GetHostByName since on some platforms (Win9x), it cannot handle
>+     * IP address literals.
>+     */
>+    rv = PR_StringToNetAddr(hostname, &ai->addr);
>+    if (rv == PR_SUCCESS) {
>+        if (!(flags & PR_AI_NOCANONNAME)) {
>+            /*
>+             * use PR_GetHostByAddr to fetch CNAME
>+             */
>+            rv = PR_GetHostByAddr(&ai->addr, ai->buf, sizeof ai->buf, &ai->hostent);
>+            if (rv == PR_SUCCESS) {
>+                ai->have_hostent = PR_TRUE;
>+            } else {
>+                /*
>+                 * reverse lookup failures are not fatal.  use the supplied
>+                 * IP address literal instead.
>+                 */
>+                ai->hostent.h_name = _pr_strdup(hostname);
>+                if (!ai->hostent.h_name) {
>+                    PR_Free(ai);
>+                    return NULL;
>+                }
>+            }
>+            ai->have_cname = PR_TRUE;
>+        }
>+        return (PRAddrInfo *) ai;
>+    }
>+
>     rv = PR_GetHostByName(hostname, ai->buf, sizeof ai->buf, &ai->hostent);
>     if (rv == PR_FAILURE) {
>         PR_Free(ai);
>         return NULL;
>     }
>+    ai->have_hostent = PR_TRUE;
>+
>+    if (!(flags & PR_AI_NOCANONNAME)) {
>+        char he_buf[PR_NETDB_BUF_SIZE];
>+        PRHostEnt he;
>+        PRNetAddr addr;
>+
>+        memset(&addr, 0, sizeof(addr));
>+        addr.inet.family = PR_AF_INET;
>+        addr.inet.ip = *(PRUint32 *) ai->hostent.h_addr_list[0];
>+
>+        /*
>+         * use PR_GetHostByAddr to fetch CNAME based on first returned IP
>+         * address.  this is done so that we are consistent with our impl
>+         * that uses getaddrinfo and getnameinfo.
>+         *
>+         * if the reverse lookup fails, then we'll use the given hostname as
>+         * the CNAME disregarding the value of h_name returned from
>+         * PR_GetHostByName.  this is again done to be consistent with our
>+         * impl that uses getaddrinfo and getnameinfo.
>+         */

I don't understand why we need to be consistent here. This is a hack, right? If
you just picked the hostname returned by gethostbyname, the win9x hack above
would go away too.


> PR_IMPLEMENT(PRAddrInfo *) PR_GetAddrInfoByName(const char  *hostname,
>                                                 PRUint16     af,
>                                                 PRIntn       flags)
> {
[...]
>+        /*
>+         * we do not pass AI_CANONNAME at this point because it is known to
>+         * result in reverse lookups for each returned IP address on some
>+         * platforms.  instead, we'll perform an explicit reverse lookup on the
>+         * first returned IP address to resolve the CNAME if requested.
>+         *
>+         * unfortunately, we are constrained by the fact that AI_CANONNAME is
>+         * not implemented uniformly across platforms.  see bug 239358 for
>+         * further info regarding this implementation.
>          */
> 
>         memset(&hints, 0, sizeof(hints));
>-        hints.ai_flags = AI_CANONNAME;
>+        hints.ai_flags = 0;
[...]
>         rv = GETADDRINFO(hostname, NULL, &hints, &res);
>-        if (rv == 0)
>+        if (rv == 0) {
>+            if (!(flags & PR_AI_NOCANONNAME)) {
>+                /*
>+                 * do a reverse lookup on the first returned IP address to
>+                 * determine the CNAME for this host.
>+                 * 
>+                 * NOTE: this may give bogus results in some cases.
>+                 */

Hmm... if I understand correctly, this means that we adopt the questionable
glibc behaviour of doing PTR requests on all platforms. Wouldn't it be better
to wrap this in an #ifdef GLIBC or something so at least other platforms don't
take the hit?

> PR_IMPLEMENT(void) PR_FreeAddrInfo(PRAddrInfo *ai)
> {
[...]
>+#if !defined(_PR_HAVE_GETADDRINFO) || defined(_PR_INET6_PROBE)
>+    PRAddrInfoFB *fb = (PRAddrInfoFB *) ai;
>+    /*
>+     * h_name field is malloc'd separately if not pointing to an address
>+     * within buf.
>+     */
>+    if (fb->have_cname &&
>+        (fb->hostent.h_name < fb->buf ||
>+         fb->hostent.h_name >= (fb->buf + sizeof fb->buf))) {
>+        PR_Free(fb->hostent.h_name);

1. Wow, it would never have occurred to me that there might be a leak there!
2. Bah, how is it that "sizeof fb->buf" compiles but "sizeof char" doesn't
compile?!? :)
thanks for looking over the patch...

from a security point of view (think our web services security model), it's
important that we have a consistent way of resolving CNAMEs on all platforms. 
that said, i'm not crazy about this patch.  it is a real shame to even consider
this much extra code for something the OS should provide.  but, given that there
is no consistency in the implementation of getaddrinfo(AI_CANONNAME), what are
we to do other than what i have proposed here?  today's web services security
model is somewhat broken (i.e., not consistent across all platforms) as a result
of GLIBC's new behavior, and that's not acceptable for Mozilla.
See also this bug in glibc bugzilla:

http://sources.redhat.com/bugzilla/show_bug.cgi?id=304

which reports the same issue and contains a patch (but not written by the glibc
developers).
This seems to be fixed in glibc CVS: the source I pulled from CVS head today
does not do PTR lookups, but if I revert to getaddrinfo.c v1.63 (5 august) it does.

So maybe we should put these patches on hold...
> So maybe we should put these patches on hold...

No.  It is wrong to look for the canonical name if it is not needed.  There is
always a cost associated with the lookup.  Just fix that damned function and
punish the original author.
Comment on attachment 155383 [details] [diff] [review]
v2 patch - nspr part

ok, let's not do it this way.  let's assume that the OS returns the right thing
when AI_CANONNAME is passed.

this may break some things like the webservices security model, but at least
that breakage will be confined to a relatively small set of operating systems.
Attachment #155383 - Attachment is obsolete: true
Attachment #155383 - Flags: review?(lorenzo)
Attachment #155384 - Flags: review?(lorenzo)
Comment on attachment 155384 [details] [diff] [review]
v2 patch - other part

This part of the v2 patch is still good.  We still want PR_AI_NOCANONNAME, but
we should implement it differently (i.e., in terms of
getaddrinfo(AI_CANONNAME)).

I won't have a chance to work on that patch until next week.
Attached patch v3 patch (nspr part) (obsolete) — Splinter Review
Darin, did you mean something like this?
Attached patch v3 patch (rest of mozilla) (obsolete) — Splinter Review
Darin, this is your v2 patch with a few small changes.

I still have a nagging doubt though: what if someone starts a lookup with
AI_NOCANONNAME and then someone else (web services?) starts a new lookup for
the same host without AI_NOCANONNAME while the first one is still in progress?

If I understand the code right, it will get back a result without a CNAME,
which is not what it wants...
Comment on attachment 156347 [details] [diff] [review]
v3 patch (rest of mozilla)

As can be seen using the handy diff-between-patches feature:

http://bugzilla.mozilla.org/attachment.cgi?oldid=155384&action=interdiff&newid=
156347&headers=1

I only slightly changed the numeric address handling and a couple of comments.
Comment on attachment 156346 [details] [diff] [review]
v3 patch (nspr part)

wtc, could you look at this?
Attachment #156346 - Flags: review?(wchang0222)
Same patch as before, minus typos.
Attachment #156347 - Attachment is obsolete: true
Comment on attachment 156346 [details] [diff] [review]
v3 patch (nspr part)

Sorry about the late reply. I just came back from
a three week vacation.

>+**  PRIntn flags        Must include PR_AI_ADDRCONFIG.  PR_AI_NOCANONNAME
>+**                      can be passed to suppress CNAME resolution.

Please change "CNAME resolution" to "the determination
of the canonical name corresponding to hostname".  The
reason is that "CNAME" is specific to DNS.  My suggested
wording comes from RFC 3493.

It seems that the comment for 'flags' should say "Must
be either PR_AI_ADDRCONFIG or PR_AI_ADDRCONFIG|PR_AI_NOCANONNAME.
Include PR_AI_NOCANONNAME to suppress the determination
of the canonical name corresponding to hostname."
I have a related comment below.

> typedef struct PRAddrInfo PRAddrInfo;
> 
>+#define PR_AI_NOCANONNAME 0x8000
>+
> NSPR_API(PRAddrInfo*) PR_GetAddrInfoByName(
>     const char *hostname, PRUint16 af, PRIntn flags);

I'm wondering if we should put the definition of
PR_AI_NOCANONNAME next to the definitions of the
other PR_AI_XXX macros.

>     /* restrict input to supported values */
>-    if ((af != PR_AF_INET && af != PR_AF_UNSPEC) || flags != PR_AI_ADDRCONFIG) {
>+    if ((af != PR_AF_INET && af != PR_AF_UNSPEC) || !(flags & PR_AI_ADDRCONFIG) ) {
>         PR_SetError(PR_INVALID_ARGUMENT_ERROR, 0);
>         return NULL;
>     }

It seems that the right change is to allow only two
possible values of flags:
PR_AI_ADDRCONFIG
PR_AI_ADDRCONFIG|PR_AI_NOCANONNAME

The (flags & PR_AI_ADDRCONFIG) test allows more than
those two values

> #if defined(_PR_HAVE_GETADDRINFO)
> #if defined(_PR_INET6_PROBE)
>     if (!_pr_ipv6_is_present)
>-        return ((const PRAddrInfoFB *) ai)->hostent.h_name;
> #endif
>-    return ((const PRADDRINFO *) ai)->ai_canonname;
>-#else
>-    return ((const PRAddrInfoFB *) ai)->hostent.h_name;
>+        {
>+            return ((const PRADDRINFO *) ai)->ai_canonname;
>+        }
>+#endif
>+
>+#if !defined(_PR_HAVE_GETADDRINFO) || defined(_PR_INET6_PROBE)
>+    const PRAddrInfoFB *fb = (const PRAddrInfoFB *) ai;
>+    return fb->have_cname ? fb->hostent.h_name : NULL;
> #endif
> }

I prefer to see you keep the original #if structure.
I want to avoid nesting #if with if, which is very
hard to read.

Also, the above code is not legal C code because the
declaration of local variable 'fb' cannot be in the
middle of the function.
Attachment #156346 - Attachment is obsolete: true
Same patch as above, but fixes comment of flags parameter to
PR_GetAddrInfoByName.
Attachment #156502 - Attachment is obsolete: true
Attached patch yet another nspr patch (obsolete) — Splinter Review
Sorry for all the confusion. Please ignore the other patches I just posted. :)
Attachment #156503 - Attachment is obsolete: true
Attachment #156346 - Flags: review?(wchang0222)
Comment on attachment 156505 [details] [diff] [review]
yet another nspr patch

wtc, is this better?
Attachment #156505 - Flags: review?(wchang0222)
Comment on attachment 156505 [details] [diff] [review]
yet another nspr patch

r=wtc.

> PR_IMPLEMENT(const char *) PR_GetCanonNameFromAddrInfo(const PRAddrInfo *ai)
> {
>+    const PRAddrInfoFB *fb = (const PRAddrInfoFB *) ai;
>+
> #if defined(_PR_HAVE_GETADDRINFO)
> #if defined(_PR_INET6_PROBE)
>     if (!_pr_ipv6_is_present)
>-        return ((const PRAddrInfoFB *) ai)->hostent.h_name;
>+        return fb->has_cname ? fb->hostent.h_name : NULL;
> #endif
>     return ((const PRADDRINFO *) ai)->ai_canonname;
> #else
>-    return ((const PRAddrInfoFB *) ai)->hostent.h_name;
>+    return fb->has_cname ? fb->hostent.h_name : NULL;
> #endif
> }

This will cause an unused local variable compiler warning
in the case that doesn't use 'fb'.
Attachment #156505 - Flags: superreview?(darin)
Attachment #156505 - Flags: review?(wchang0222)
Attachment #156505 - Flags: review+
Comment on attachment 156505 [details] [diff] [review]
yet another nspr patch

>Index: nsprpub/pr/include/prnetdb.h

> #define PR_AI_ALL        0x08
> #define PR_AI_V4MAPPED   0x10
> #define PR_AI_ADDRCONFIG 0x20
>+#define PR_AI_NOCANONNAME 0x8000
> #define PR_AI_DEFAULT    (PR_AI_V4MAPPED | PR_AI_ADDRCONFIG)

might be nice to make all the values line up in a column.


>+    ai->has_cname = (flags & PR_AI_NOCANONNAME) ? PR_FALSE : PR_TRUE;

just use a logical not operator?  or maybe the compiler would
generate the same code when optimized.

  ai->has_cname = !(flags & PR_AI_NOCANONNAME);


sr=darin
Attachment #156505 - Flags: superreview?(darin) → superreview+
Attached patch final NSPR patchSplinter Review
Here's the final patch for NSPR that I landed on the trunk and
NSPRPUB_PRE_4_2_CLIENT_BRANCH.
Attachment #156505 - Attachment is obsolete: true
(In reply to comment #77)
> I still have a nagging doubt though: what if someone starts a lookup with
> AI_NOCANONNAME and then someone else (web services?) starts a new lookup for
> the same host without AI_NOCANONNAME while the first one is still in progress?
> 
> If I understand the code right, it will get back a result without a CNAME,
> which is not what it wants...

you are correct.  this is a serious bug in the current patch.  i'm looking at
the code now to see about fixing that.
lorenzo: i think the solution is to use a key for the dns cache that is composed
of the hostname being queried, the presence of the RES_CANON_NAME flag, and
probably also the address family being queried.
Attached patch v4 patch (rest of mozilla) (obsolete) — Splinter Review
This patch implements my proposal to use a hash key that is composed of the
hostname, flags, and af parameters passed to ResolveHost.
Attachment #155384 - Attachment is obsolete: true
Attachment #156370 - Attachment is obsolete: true
Attachment #157588 - Flags: review?(lorenzo)
Comment on attachment 157588 [details] [diff] [review]
v4 patch (rest of mozilla)

A couple of nits:

>Index: extensions/negotiateauth/nsNegotiateAuthSSPI.cpp
>===================================================================
>+    rv = dns->Resolve(Substring(buf, index + 1),
>+                      nsIDNSService::RESOLVE_CANONICAL_NAME,

Does this need CNAME resolution?
(I know nothing about negotiateauth, so this is almost certainly right, but
still)

>Index: netwerk/dns/public/nsIDNSService.idl
>===================================================================
>@@ -60,47 +60,64 @@ interface nsIDNSService : nsISupports
>[...]
>     nsIDNSRequest asyncResolve(in AUTF8String    aHostName,
>-                               in boolean        aBypassCache,
>+                               in unsigned long  aFlags,
>                                in nsIDNSListener aListener,
>                                in nsIEventQueue  aListenerEventQ);
> 
>     /**
>-     * called to synchronously resolve a hostname.
>+     * called to synchronously resolve a hostname.  warning this method may
>+     * block the calling thread for a length period of time.  it is extremely

s/length/long

>Index: netwerk/dns/src/nsDNSService2.cpp
>===================================================================
>@@ -1,10 +1,11 @@
>+/* vim:set ts=4 sw=4 sts=4 et cin: */

>Index: netwerk/dns/src/nsHostResolver.cpp
>===================================================================
>@@ -1,10 +1,11 @@
>+/* vim:set ts=4 sw=4 sts=4 et cin: */

Should these be here?

> [...]
>+// this macro filters out any flags that are not used when constructing
>+// the host key.  the significant flags are those that would affect the
>+// result of the host resolution.
>+#define RES_KEY_FLAGS(_f) ((_f) & nsHostResolver::RES_CANON_NAME)

Hmm... it's not really that the flags influence host resolution, since the
bypass cache flag also affects it. Maybe you could say "the significant flags
are those that would be passed to the system resolver"? Although that's not
very good either.

>Index: netwerk/dns/src/nsHostResolver.h
>===================================================================
>+/* vim:set ts=4 sw=4 sts=4 et cin: */

Ditto

Aside from that, looks good! r=lorenzo
Attachment #157588 - Flags: review?(lorenzo) → review+
> Does this need CNAME resolution?
> (I know nothing about negotiateauth, so this is almost certainly right, but
> still)

Yeah, under windows the service principal name is formed from the canonical
hostname.  At least, this is what IE does, so we try to do the same.


> >+/* vim:set ts=4 sw=4 sts=4 et cin: */

Yeah, I like to add vim control lines so that folks don't have to change their
global editor settings as they jump around to different files in the mozilla
source.  You'll see similar stuff in other files around the tree.


> Hmm... it's not really that the flags influence host resolution, since the

yeah, good point.  i'll come up with something better.
final patch
Attachment #157588 - Attachment is obsolete: true
Attachment #157627 - Flags: superreview?(bryner)
Attachment #157627 - Flags: superreview?(bryner) → superreview+
fixed-on-trunk for mozilla 1.8a4
Status: ASSIGNED → RESOLVED
Closed: 20 years ago20 years ago
Resolution: --- → FIXED
Comment on attachment 157627 [details] [diff] [review]
v4.1 patch (rest of mozilla)

this patch solves a significant performance problem on newer linux distros. 
while the patch is large and involved, it might be worth considering for the
aviary branch at least, and maybe the 1.7 branch as well.
Attachment #157627 - Flags: approval1.7.x?
Attachment #157627 - Flags: approval-aviary?
Comment on attachment 157627 [details] [diff] [review]
v4.1 patch (rest of mozilla)

after some discussion a=chofmann,brendan for the branches
Attachment #157627 - Flags: approval1.7.x?
Attachment #157627 - Flags: approval1.7.x+
Attachment #157627 - Flags: approval-aviary?
Attachment #157627 - Flags: approval-aviary+
fixed-aviary1.0
Keywords: fixed-aviary1.0
fixed1.7.x
Keywords: fixed1.7.x
I just got a
c:/work/mozilla_source/1.7/mozilla/netwerk/dns/src/nsHostResolver.cpp(635) :
error C2065: 'PR_AI_NOCANONNAME' : undeclared identifier building on WinXp. Do I
need to clean and rebuild or is this a problem?
bc: make sure you update and rebuild nsprpub.  you shouldn't have to clobber
anything.  all of the tinderboxen cleared.
Whiteboard: [patch] → [patch] [lin14523]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: