Closed Bug 193827 Opened 22 years ago Closed 20 years ago

DNS sometimes hangs in TCP mode

Categories

(Core :: Networking, defect)

x86
Linux
defect
Not set
normal

Tracking

()

VERIFIED WORKSFORME

People

(Reporter: mal, Assigned: gordon)

References

()

Details

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Once in a while
DNS resolver in mozilla-1.0.1-2.7.3 (RedHat 7.3) 
hangs. It prints "resolving host.name.com" in status line
and can not open a site.
Already opened sites work OK, but new sites can not be resolved.
If I exit from mozilla - this does not help because some 
mozilla processes still running (see below).
The only way to fix this - do 
killall mozilla-bin
(sometimes need to be done several times).
Only after this a new instance of mozilla can be started.
This is very annoying.
The problem seems exists since the time of Netscape 3
and in all Mozilla versions.
I had this (very annoying) problem on a number of 
Linux/Solaris versions with a valiety of mozilla versions.

P.S.
When DNS hungs - then these
are the proseess running in background after I exited from mozilla.

ps axuww|grep mozilla
mal      14281  0.2 17.0 63140 43712 ?       S    Feb17   2:02
/usr/lib/mozilla/mozilla-bin
mal      14287  0.0 17.0 63140 43712 ?       S    Feb17   0:00
/usr/lib/mozilla/mozilla-bin
mal      14289  0.0 17.0 63140 43712 ?       S    Feb17   0:00
/usr/lib/mozilla/mozilla-bin
mal      14291  0.0 17.0 63140 43712 ?       S    Feb17   0:01
/usr/lib/mozilla/mozilla-bin



Reproducible: Sometimes

Steps to Reproduce:
This happens on average once a week.
Note that when I do
dig host.name.com
in command line I get a proper response.
The DNS is working.
This is a problem with mozilla DNS resolver.


Also , as it pointed out in 
http://bugzilla.mozilla.org/show_bug.cgi?id=188332

The site http://story.news.yahoo.com/
always causes this problem.

But command 
# host story.news.yahoo.com 
shows correct DNS resolution

story.news.yahoo.com is an alias for dailynews.yahoo.com.
dailynews.yahoo.com is an alias for dailynews.yahoo.akadns.net.
dailynews.yahoo.akadns.net has address 64.58.76.117
.
Assignee: asa → dougt
Component: Browser-General → Networking
QA Contact: asa → benc
-> invalid because the build is to old and you use a non mozilla.org build.
Status: UNCONFIRMED → RESOLVED
Closed: 22 years ago
Resolution: --- → INVALID
It is not that old (Oct 2002), just three months old.
And I bet same error also present in new builds.
You would rather test and find out the cause of this very annoying bug
(people often need to reboot their computers because of this)
rather than mark a valid and well described report as invalid.
Vladislav, if you can reproduce this bug on a new, Mozilla.org build then please
reopen this report.

For a bug to be valid, it must be reproducible on a build < 1 month old.
Can you check the web site

http://story.news.yahoo.com/

from your new browser. It often (but not always) hangs on mine.
This URL does not hang for me using 20030210 on WinXP.
Exactly the same DNS problem exists with

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030210


I just clicked on
http://story.news.yahoo.com/
and same things "resolving story.news.yahoo.com" and few 
processes in background, exist even after I exitsed mozilla.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
Also note that the
http://story.news.yahoo.com/
sometimes works. Same story as I mentioned earlier.
If this is important:
My Linux box (on which mozilla has this problem) 
has 192.168.x.x IP
it is behind another linux box (masquerading).
The DNS servers are

nameserver 194.8.160.90
nameserver 195.131.52.130

DNS is working OK. 
host story.news.yahoo.com 
always show the right IP:
story.news.yahoo.com is an alias for dailynews.yahoo.com.
dailynews.yahoo.com is an alias for dailynews.yahoo.akadns.net.
dailynews.yahoo.akadns.net has address 64.58.76.117

The thing you may be interested in is:
dig sometimes gets DNS response in TCP mode
for this specific host: story.news.yahoo.com

-----------
 dig story.news.yahoo.com A
;; Truncated, retrying in TCP mode.
.......

I am not sure this is related, seems no.
Also, seems mozilla DNS resolver does not have
its own timeout set right.



Also, this may be related 
(but note that mozilla DNS resolver timewout is wrong anyway).

From time to time I get this with dig:
-------------------------------------------
$ dig story.news.yahoo.com a
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.2.1 <<>> story.news.yahoo.com a
;; global options:  printcmd
;; connection timed out; no servers could be reached
------------------------------------------------
No such problem with other hosts.
I bet this is somehow related to timeout when 
DNS request is made in tcp mode.

(you can imitate it with this).
dig +tcp story.news.yahoo.com a

Mozilla DNS resolver in tcp mode seems does not work right.
(Also note that timeouts are also wrong).
This is what happens when mozilla DNS hangs:
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030210


1. It tries to get DNS via UDP - it somehow fails.
2. Then it constantly trying to access DNS via TCP,
    gets no reeply but continues to try.
    This seems same error as dig in example above,
    but dig has the right timeout set, mozilla does not have right timeout.
    As a result the resolver just hangs.

This problem exist in all mozilla & netscape I tried.


598 PROTO=UDP SPT=53 DPT=32794 LEN=158 
Feb 19 13:58:05 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=194.8.160.90
LEN=66 TOS=0x00 PREC=0x00 TTL=64 ID=4660 DF PROTO=UDP SPT=32794 DPT=53 LEN=46 
Feb 19 13:58:05 hnx kernel: IN=eth0 OUT=
MAC=00:d0:b7:07:7b:f1:00:d0:b7:07:7b:f0:08:00 SRC=194.8.160.90 DST=192.168.3.97
LEN=536 TOS=0x00 PREC=0x00 TTL=61 ID=42793 PROTO=UDP SPT=53 DPT=32794 LEN=516 
Feb 19 13:58:05 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=194.8.160.90
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16898 DF PROTO=TCP SPT=47409 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 13:58:08 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=194.8.160.90
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39860 DF PROTO=TCP SPT=47409 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 13:58:14 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=194.8.160.90
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39861 DF PROTO=TCP SPT=47409 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 13:58:26 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=194.8.160.90
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39862 DF PROTO=TCP SPT=47409 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 13:58:50 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=194.8.160.90
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39863 DF PROTO=TCP SPT=47409 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 13:59:38 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=194.8.160.90
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39864 DF PROTO=TCP SPT=47409 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 14:01:14 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=195.131.52.130
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39865 DF PROTO=TCP SPT=47410 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 14:01:17 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=195.131.52.130
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=10686 DF PROTO=TCP SPT=47410 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 14:01:23 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=195.131.52.130
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=10687 DF PROTO=TCP SPT=47410 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0 
Feb 19 14:01:35 hnx kernel: IN= OUT=eth0 SRC=192.168.3.97 DST=195.131.52.130
LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=10688 DF PROTO=TCP SPT=47410 DPT=53
WINDOW=5840 RES=0x00 SYN URGP=0

I thought TCP mode was only used for:

1- named-xfer
2- responses that do not fit in a single UDP datagram.

Is there any indication why TCP is used? A client should probably never end up
in TCP mode for DNS.

Flags: blocking1.4a?
Hardware: Other → PC
Summary: DNS sometimes hangs. → DNS sometimes hangs in TCP mode
As far as I know client may use TCP mode,
and dns resolvers which come with glibc (and bind-utils)
sometimes uses DNS in TCP mode.

It definitelly happen when a
response does not fit in a single UDP datagram.

See
http://colalug.org/members/resources/IP-Chains/HOWTO-5.html
--------------------
5.2 What not to filter out. 
TCP connections to DNS (nameservers).

If you're trying to block outgoing TCP connections, remember that DNS doesn't
always use UDP; if the reply from the server exceeds 512 bytes, the client uses
a TCP connection (still going to port number 53) to get the data.

This can be a trap because DNS will `mostly work' if you disallow such TCP
transfers; you may experience strange long delays and other occasional DNS
problems if you do. 
--------------------

It is pretty common to have a long DNS response which goes via TCP.
glibc definitelly uses both UDP and TCP dns in client mode.
mozilla can be just linked against DNS resolver provided 
by underlying OS. Such DNS resolver is usually stable and has 
the right timeouts.
For example (on linux) the libraries /usr/lib/libdns.so.5.3.0 from bind-utils
or /lib/libresolv-2.2.5.so from glibc.
Seems does the right thing.
Similar libraries exist in every OS, 
I think there is no need to duplicate DNS code in mozilla.

man resolver (on Linux)
RESOLVER(3)         Linux Programmer's Manual         RESOLVER(3)

NAME
       res_init,    res_query,    res_search,    res_querydomain,
       res_mkquery, res_send, dn_comp, dn_expand - resolver  rou-
       tines

SYNOPSIS
       #include <netinet/in.h>
       #include <arpa/nameser.h>
       #include <resolv.h>
       extern struct state _res;



rpm -ql glibc |grep resol
/lib/libresolv-2.2.5.so
/lib/libresolv.so.2

Also see bind-utils package.
rpm -ql bind-utils-9.2.1-0.7x
/usr/bin/dig
/usr/bin/host
/usr/bin/nslookup
/usr/lib/libdns.so.5
/usr/lib/libdns.so.5.3.0
/usr/lib/libisc.so.4
/usr/lib/libisc.so.4.1.0
/usr/share/man/man1/dig.1.gz
/usr/share/man/man1/host.1.gz
/usr/share/man/man5/resolver.5.gz
/usr/share/man/man8/nslookup.8.gz
/usr/share/man/man8/nsupdate.8.gz
Also, from 
man resolver (on RedHat linux 7.3)

RES_USEVC
         Use TCP connections for queries rather than UDP datagrams.
RES_IGNTC
         Ignore truncation errors.  Don't retry with TCP.  [Not currently
         implemented].

Mozilla current dns resolver seems not working right,
while underlying OS dns resolver is OK.
Why not just to use it.
By the way,
if you need to reproduce this DNS problem - it rather easy on Linux.

1. Set to deny ougoing TCP traffic to your DNS server(s).
iptables -I OUTPUT -p TCP -d yor.dns.ip -j DROP
2. Hit any web site for which DNS response is >512 bytes,
like http://story.news.yahoo.com/ (but this site is trange, 
it sometimes DNS server give <512 bytes reponse for it).
A better option is to craft a special DNS site with a binch of IPs/nameservers.
3. See the stalls.

My opinion - the best option is to use underlying OS 
DNS resolver, rather than some home-grown one.

The problem is extremelly annoyoing one.
I seen people need to reboot their computer
(few ones find to do killall mozilla-bin on Linux
or find and kill process in MS-Windows task manager).

+clean-report, cc gordon.

I *know* why it does a mode switch (in fact another case is if you have MTU that
is very small...) but I didn't realize that it happened so easily in real life.
If I were a working/breathing hostmaster, I would never configure my domain to
return such large responses.

We've had this problem on and off for a long time for a couple sites, and nobody
ever put theire finger on it.

I don't understand the DNS/TCP timeout. Shouldn't we just call the resolver, and
it manages the TCP timeoute values? I can see that this jams up the DNS service,
because we have a serialized service for DNS.

So, if I understand this correctly, this only happens sometimes, and if it does,
the TCP connection actually fails (so you'd see a SYN_SENT entry in netstat -tcp )?
Keywords: clean-report
>We've had this problem on and off for a long time for a couple sites, and nobody
>ever put theire finger on it.

Now it is very common. Especially some sites in .aol.com networks
have vely long list of nameservers/IPs.
Also, some sites have very strange behavior, that 
returned DNS sometimes too long for UDP and sometimes not.

>I don't understand the DNS/TCP timeout. Shouldn't we just call the resolver, and
>it manages the TCP timeoute values? I can see that this jams up the DNS service,
>because we have a serialized service for DNS.

Back in netscape 2.x and early versions ov 3.x on UNIX the DNS lookup 
was called directly via UNIX 
resolver from the thread which does graphical re-painting.
It was OK, (except during lookup no button can be clicked, 
because it was done from graphical re-painting thread),
and some computers had weirid DNS timeouts (like 5 minutes).

I personally would use resolver system function 
(or whatever else comes with other OSes like Windows),
with reasonable DNS (UDP/TCP) timeout value,
(preferrably explicitly set in prefs.js, but only default value is also OK), 
and doing lookup from a separate thread, as mozilla does right now,
so user can click on buttons during DNS lookup.
I did not check recent linux API, whether it is possible to setup 
TCP timeout for "connect" during DNS tcp lookup via "resolver" function.
If not - a common approach with something like alert() signal can be used
to terminate a thread doing lookup for too long.

Also important resolver feature to add should be this:
if, during DNS lookup, user clicked "STOP" button 
the DNS lookup should be immediatelly aborted
(like by sending the same ALERT signal to thread doing dns lookup).

There also should be no "runaway" DNS lookup processes,
as it happens right now after an exit from mozilla when
the described in this BUG #193827 falure occures.

This is my personal opinion, I may be missed something.

>So, if I understand this correctly, this only happens sometimes, 
>and if it does, the TCP connection actually fails 
>(so you'd see a SYN_SENT entry in netstat -tcp )?

Yes, my DNS provider(wrongly) does not allow DNS lookup via tcp,
all TCP request to port 53 just gets denied without ICMP back.
-----------------
$ telnet 194.8.160.90 53
Trying 194.8.160.90...
=== and 20 minutes later ===
Connection timeout
-----------------

The descrobed in BUG # 193827 problem occures 
always(almost) when I get this from dig:
-----------------
$ dig story.news.yahoo.com a
;; Truncated, retrying in TCP mode.
-----------------
But dig sometimes seldomly manages 
to get a response in UDP mode, sometimes not.
This is why we have such intermitten problem.

I think the described problem occures when UDP request was truncated,
mozilla tried to access in TCP and tcp failed.

The reason why me (and many other people I know)
noticed this is while such multiple falure is seldom by iteslf
(in other location when DNS server 
properly working with TCP requests this happens with me about once a month),
once it happened - the mozilla is in unusable state.
You need to reboot the computer or manually kill the process.
This is extremelly annoying, and why people notice this.

Only in your bugzilla database I found several such reports
about mozilla in unusable state.

http://bugzilla.mozilla.org/show_bug.cgi?id=192271
http://bugzilla.mozilla.org/show_bug.cgi?id=188332

(and probably many others, because some people attribute this 
to stalls, not to DNS).
so is this report confirmed? What is it still in the Unconfirmed state?
There is enough data to have an engineer research this TCP timeout suggested by
the dig output.

I've looked for an API that shows a TCP timeout for resolver and been unable to
figure out why dig works and mozilla doesn't. I also have discussed this with
Darin, and he doesn't know of any obvious explainations for the information we
have here.

There is also more analysis I could do, if you made it a 1.4a blocker, which
would take me off cookies for a half day.
Flags: blocking1.4a? → blocking1.4a-
This problem appears to be much more common, than I originally thought.

What seems to be happening:
1. mozilla sends DNS query via UDP, but the datagram gets dropped
   on its way to DNS server because of high network traffic.
2. mozilla then is trying to send DNS query via TCP and gets 
   the problem described in this bug.

The described behaviour occures pretty commonly.

And what is the most annoyning - after this scenario happened - 
mozilla is in unusable state. Even exit from mozilla does not help:
several mozilla processes are still running in backgroung (see below).
User needs either reboot the computer or do
killall mozilla-bin
Otherwise a new mozilla instance can not be started.

See another bug
http://bugzilla.mozilla.org/show_bug.cgi?id=192271
with similar runaway DNS query processes in background.

Or search google - there is a number of postings
about mozilla forcing users to reboot their computers.

P.S. Processes in background after mozilla exit:
ps axuww|grep mozilla
mal       1239  2.3 19.4 62616 49532 ?       S    03:00   3:35
/usr/lib/mozilla/mozilla-bin
mal       1245  0.0 19.4 62616 49532 ?       S    03:00   0:00
/usr/lib/mozilla/mozilla-bin
mal       1247  0.0 19.4 62616 49532 ?       S    03:00   0:00
/usr/lib/mozilla/mozilla-bin
mal       1249  0.0 19.4 62616 49532 ?       S    03:00   0:01
/usr/lib/mozilla/mozilla-bin
mal      17105  0.0 19.4 62616 49532 ?       S    05:11   0:00
/usr/lib/mozilla/mozilla-bin
re comment #23:

Vladislav, how are you exiting mozilla?
I am exiting mozilla from main menu via

File->Quit

All mizilla windows gets closed,
but mozilla processes still running in background.
They are seen as 
ps axuww|grep mozilla
Dougt, can you look into this for 1.4? 
Status: UNCONFIRMED → NEW
Ever confirmed: true
first over to gordon.
Assignee: dougt → gordon
Flags: blocking1.4b?
Flags: blocking1.4b-
Flags: blocking1.4?
Flags: blocking1.4? → blocking1.4-
We've had this problem for a long time. If I could show that there are many
other cases of this problem, and they are causing people to restart their
browser, would that justify blocking 1.4?

Or if I could reproduce this internally, would it justify getting some
engineering analysis?

At this time, I don't think that clearning the DNS cache is a workaround. When
this happens, your DNS is broken until you restart.

>and they are causing people to restart their browser

You can not just restart browser, you need manually kill it
using UNIX kill command.
Browser restart just does not help.


In regards with reproducing this bug.
The easiest way seems to be the following:

1. Create a DNS record which is long enough 
   so the request is made in UDP and then in TCP mode.
2. Set firewall on client machine 
   (using ipchains or iptables on Linux) to drop DNS requests in TCP mode.

Then I believe the problem can be reproduced.
>Browser restart just does not help.

I meant File->Quit does not terminate mozilla once the problem 
in question occured.
Some processes left in bacgrount after File->Quit
Vladislav: the problem of mozilla not exiting on a stalled DNS lookup is bug 192271.
this is probably a duplicate of bug 192271... or at least, once that bug is
fixed, this one should pretty much fall to the way side.  marking as dependency
for now.
Depends on: 192271
This was probably fixed by the DNS service rewrite. The DNS problem is not
mozilla's fault, since it just calls getaddrinfo() or gethostbyname(). What was
a problem is that mozilla hangs on exit if a DNS lookup is pending. Marking
WORKSFORME since last comment was rather a long time ago.

Reporter, please reopen if you still see this on a recent build.
Status: NEW → RESOLVED
Closed: 22 years ago20 years ago
Resolution: --- → WORKSFORME
V. 
That sounds right to me.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.