Closed Bug 260832 Opened 20 years ago Closed 20 years ago

DNS name resolution fails under WinXP-SP2 when max persistent connections bumped up to 100

Categories

(Core :: Networking, defect)

x86
Windows XP
defect
Not set
minor

Tracking

()

RESOLVED INVALID
Future

People

(Reporter: lisanke, Assigned: darin.moz)

Details

(Keywords: helpwanted)

Attachments

(2 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 MultiZilla/1.6.4.0b
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 MultiZilla/1.6.4.0b

First I noted many 'DNS problem' bugs in bugzilla; it is not clear which are
open, at my level, with related symptoms. Hopefully, an expert (developer) will
dupe this to a known problem (with a workaround or fix). Mozilla on XP has DNS
resolution trouble with particular specific 'intranet' URLs (most all other URL
resolve). Mozilla on Win2K (same DNS) has problems resolving all (most) any URL.
NSLOOKUP of site-domain works. Use of the returned dotted IP address in-place of
the site-domain allows Mozilla to browse normally (except page referenced
relative links fail to resolve).

Reproducible: Sometimes
Steps to Reproduce:
1.(XP) Select w3.ibm.com (specific intranet URL).
2. Wait for 'site' not found alert. 

1. (W2K) Select www.yahoo.com (or any other).
2. Wait for 'site' not found alert.

Actual Results:  
Alert: www.yahoo.com could not be found. Check the name and try again.

Expected Results:  
resolve browser to yahoo.
Hi, can you please generate a Mozilla log file for us?  Here's what you'd need
to do:

Open a DOS prompt and set the following environment variables:

  C:\> set NSPR_LOG_MODULES=nsHostResolver:5
  C:\> set NSPR_LOG_FILE=c:\log.txt

Now, in the same DOS prompt, 'cd' into the directory containing Mozilla, and
then run it from the DOS prompt.  Something like this:

  C:\> cd \path\containing\mozilla
  C:\path\containing\mozilla\> .\mozilla.exe

Now, reproduce the problem, and close Mozilla.  Then upload c:\log.txt to this
bug report using the "Create a New Attachment" link.

Thanks!!
Darin,

On my W2K machine I found I had 1.6; I tried the logging you suggest; no log
after running with your setup. I installed 1.7.3 and repeated; no log after
setup. I will reproduce failure of 

  C:\> set NSPR_LOG_MODULES=nsHostResolver:5
  C:\> set NSPR_LOG_FILE=c:\log.txt

to produce a logfile, when running mozilla from same DOS session, on the XP
machine next. I don't know what, but I hope it means something, that no logging
was produced.

I'll comment again after restarting browser (after test on (primary) XP machine).

Regards,

Mike
Darin,

OK, back from running experiment on XP Mozilla; again no log.txt was produced.
Something I should have said in the last comment; on W2K a 0 bytes c:\log.txt
was produced; on XP nothing.

If this is not expected; maybe their is a trival error in the SET syntax
(version issue)? Just a guess.

Regards,

Mike


Darin,

I read your note on Logging in Mozilla's online debug documents, which indicates
specifically "if you have a nightly trunk build of Mozilla, then you can set
some environment variables before running Mozilla to capture a log of HTTP
activity." This implies a 'nightly trunk build' of Mozilla. 

Is a nightly build required to get debug logging? Should I install one of these
(on W2K backup), and which? 

Regards,

Mike
All recent release builds should do logging but it`s possible that
"nsHostResolver:5" is only availible in a debug build..
I'm having the same problem as described here. The recommended log file also 
produced nothing for me. If I set it to "all:5" I get a big log, of course, 
but I'm not sure that's useful. Should I post it?

The easiest way I can repro this is to go to Google News and Ctrl-Click a pile 
of links (say 10-12) to generate new tabs. 

This is with the latest nightly (20040917), so provisos re: nightly builds 
shouldn't apply in this case.

In case it's relevant, I have several extensions installed, some of which 
might be relevant, including: LiveHTTPHeaders, TabbrowserExtensions, and 
AdBlock. 

I used to think that this or a bug similar to this was related to pipelining, 
but I turned that back off a long time ago (and have left it disabled), and 
while it seemed to help at the time, it doesn't seem to matter for this issue.
Ray,

Thanks for the comment, and idea to use 'all:5' to catch some log of the
problem. I did a quick test (bring up mozilla to my home page; fails on W2K
machine) and terminate immediately after fail. I produced a log (trimmed of some
redundant lines which I attach to end of message). Interesting thing, with
logging as all:5 the failure seemed to occur (present) faster, than without.

Regards,

Mike

---------------------------big logfile attached here----------------------------
0[233e10]: Loaded library Executable (init)
0[233e10]: Loaded library ws2_32.dll (load lib)
0[233e10]: Unloaded library ws2_32.dll
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\xpcom.dll (load lib)
0[233e10]: C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\xpcom.dll incr => 2 (find lib)
0[233e10]: C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\xpcom.dll decr => 1
0[233e10]: read -> 161084
0[233e10]: read -> 91163
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\embedcomponents.dll (load lib)
0[233e10]: Loaded library C:\Program
Files\mozilla.org\Mozilla\components\qfaservices.dll (load lib)
0[233e10]: Loaded library C:\Program
Files\mozilla.org\Mozilla\components\typeaheadfind.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\xppref32.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\necko.dll (load lib)
0[233e10]: read -> 88
***************************************60 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 3416
0[233e10]: read -> 0
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\gklayout.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\caps.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\xpc3250.dll (load lib)
0[233e10]: read -> 7424
0[233e10]: read -> 2543
0[233e10]: connect -> 0
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\i18n.dll (load lib)
1204[edc160]: nsSocketTransportService::Run
1204[edc160]:   calling PR_Poll [active=0 idle=0]
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\imglib2.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\rdf.dll (load lib)
0[233e10]: Loaded library C:\Program
Files\mozilla.org\Mozilla\components\msgMapi.dll (load lib)
0[233e10]: Loaded library C:\Program
Files\mozilla.org\Mozilla\components\jsd3250.dll (load lib)
0[233e10]: Loaded library C:\Program
Files\mozilla.org\Mozilla\components\palmsync.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\appshell.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\gkwidget.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\profile.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\xpcom_compat_c.dll (load lib)
0[233e10]: read -> 2836
0[233e10]: read -> 0
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\chrome.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\gkparser.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\uconv.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\docshell.dll (load lib)
0[233e10]: read -> 4096
0[233e10]: read -> 4096
***************************************10 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 12145
0[233e10]: read -> 8634
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\gkgfxwin.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\pipboot.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\oji.dll (load lib)
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\gkplugin.dll (load lib)
0[233e10]: NPN Logging Active!
0[233e10]: General Plugin Logging Active! (nsPluginHostImpl::ctor)
0[233e10]: NPP Logging Active!
0[233e10]: nsPluginHostImpl::ctor
0[233e10]: read -> 225
0[233e10]: LoadCachedPluginsInfo : Loading Cached plugininfo for C:\Program
Files\mozilla.org\Mozilla\plugins\npnul32.dll
0[233e10]: nsPluginHostImpl::ScanPluginsDirectory dir=C:\Program
Files\mozilla.org\Mozilla\plugins
0[233e10]: nsPluginTag::RegisterWithCategoryManager plugin=C:\Program
Files\mozilla.org\Mozilla\plugins\npnul32.dll, removing = no
0[233e10]: nsPluginHostImpl::GetPluginFactory End mime=application/x-java-vm,
rv=-2147467259, plugin=0 name=(not found)
0[233e10]: read -> 1375
0[233e10]: read -> 55805
0[233e10]: read -> 7689
***************************************10 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 1282
0[233e10]: read -> 226
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\cookie.dll (load lib)
0[233e10]: read -> 409
0[233e10]: read -> 0
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\webbrwsr.dll (load lib)
0[233e10]: Loaded library C:\Program
Files\mozilla.org\Mozilla\components\appcomps.dll (load lib)
0[233e10]: Creating nsHttpHandler [this=e70938].
0[233e10]: nsHttpHandler::Init
0[233e10]: nsHttpHandler::PrefsChanged [pref=(null)]
0[233e10]: read -> 4096
0[233e10]: read -> 1752
0[233e10]: read -> 0
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\jar50.dll (load lib)
0[233e10]: read -> 1024
0[233e10]: read -> 4096
***************************************15 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 1673
0[233e10]: read -> 0
0[233e10]: read -> 30
0[233e10]: read -> 178
0[233e10]: nsHttpAuthCache::Init
0[233e10]: Creating nsHttpConnectionMgr @140f848
0[233e10]: nsHttpConnectionMgr::Init
0[233e10]: nsHttpHandler::StartPruneDeadConnectionsTimer
0[233e10]: read -> 32
0[233e10]: read -> 0
0[233e10]: read -> 5312
0[233e10]: read -> 8192
0[233e10]: read -> 16
0[233e10]: read -> 8172
0[233e10]: read -> 8192
0[233e10]: read -> 8192
***************************************220 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 0
0[233e10]: read -> 30
0[233e10]: read -> 3946
0[233e10]: read -> 498
1156[140a380]: read -> 30
1156[140a380]: read -> 44
1156[140a380]: read -> 30
1156[140a380]: read -> 57
1156[140a380]: read -> 30
1156[140a380]: read -> 57
1156[140a380]: read -> 30
1156[140a380]: read -> 56
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 3066
0[233e10]: read -> 0
1156[140a380]: read -> 30
1156[140a380]: read -> 56
0[233e10]: read -> 30
0[233e10]: read -> 8192
***************************************70 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 8192
0[233e10]: read -> 8192
0[233e10]: read -> 8192
0[233e10]: read -> 8192
0[233e10]: read -> 1883
0[233e10]: read -> 0
0[233e10]: read -> 4533
1156[140a380]: read -> 1024
1156[140a380]: read -> 2566
1156[140a380]: read -> 0
1156[140a380]: read -> 30
1156[140a380]: read -> 2118
0[233e10]: read -> 8192
0[233e10]: read -> 8192
***************************************60 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 2132
0[233e10]: read -> 30
0[233e10]: read -> 4498
0[233e10]: read -> 8192
0[233e10]: read -> 8192
0[233e10]: read -> 2759
0[233e10]: read -> 0
0[233e10]: Loaded library C:\Program
Files\mozilla.org\Mozilla\components\txmgr.dll (load lib)
0[233e10]: read -> 30
0[233e10]: read -> 3202
0[233e10]: read -> 30
0[233e10]: read -> 4096
***************************************20 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 1010
1156[140a380]: read -> 30
1156[140a380]: read -> 68
1156[140a380]: read -> 30
1156[140a380]: read -> 53
0[233e10]: read -> 30
0[233e10]: read -> 4096
***************************************10 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 62
1156[140a380]: read -> 30
1156[140a380]: read -> 2300
0[233e10]: read -> 30
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 1328
***************************************10 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 2024
1156[140a380]: read -> 30
1156[140a380]: read -> 65
1156[140a380]: read -> 30
1156[140a380]: read -> 2228
0[233e10]: read -> 30
0[233e10]: read -> 787
***************************************20 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 2730
0[233e10]: read -> 30
0[233e10]: read -> 8192
0[233e10]: read -> 536
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 1156
0[233e10]: read -> 4096
0[233e10]: read -> 1790
1156[140a380]: read -> 30
1156[140a380]: read -> 178
0[233e10]: Loaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\components\editor.dll (load lib)
1156[140a380]: read -> 30
1156[140a380]: read -> 217
1156[140a380]: read -> 30
1156[140a380]: read -> 1480
0[233e10]: read -> 30
0[233e10]: read -> 488
1156[140a380]: read -> 30
0[233e10]: read -> 30
0[233e10]: read -> 3937
1156[140a380]: read -> 165
1156[140a380]: read -> 30
1156[140a380]: read -> 200
1156[140a380]: read -> 30
1156[140a380]: read -> 51
0[233e10]: read -> 30
0[233e10]: read -> 978
0[233e10]: read -> 30
0[233e10]: read -> 4096
***************************************20 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 1257
0[233e10]: read -> 30
0[233e10]: read -> 3826
1156[140a380]: read -> 30
1156[140a380]: read -> 1030
1332[19b97b0]: read -> 30
1332[19b97b0]: read -> 112
1316[19b9180]: read -> 30
1316[19b9180]: read -> 67
0[233e10]: read -> 30
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 2415
1332[19b97b0]: read -> 30
1332[19b97b0]: read -> 941
1316[19b9180]: read -> 30
1316[19b9180]: read -> 929
1156[140a380]: read -> 30
1156[140a380]: read -> 932
1332[19b97b0]: read -> 30
1332[19b97b0]: read -> 132
1316[19b9180]: read -> 30
1316[19b9180]: read -> 214
0[233e10]: read -> 30
0[233e10]: read -> 1827
0[233e10]: read -> 30
0[233e10]: read -> 3194
1156[140a380]: read -> 30
1332[19b97b0]: read -> 30
1332[19b97b0]: read -> 163
1156[140a380]: read -> 908
0[233e10]: read -> 2257
0[233e10]: SecureUI:1acf128: Init
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 18ad4e0 OnStateChange 10001
about:layout-dummy-request
0[233e10]: SecureUI:1acf128: OnStateChange: not relevant for sub content
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1aaaa20 OnStateChange 30004 about:blank
0[233e10]: SecureUI:1acf128: OnStateChange: not relevant for sub content
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 18ad4e0 OnStateChange 10004
about:layout-dummy-request
0[233e10]: SecureUI:1acf128: OnStateChange: not relevant for sub content
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 18ad4e0 OnStateChange 10010
about:layout-dummy-request
0[233e10]: SecureUI:1acf128: OnStateChange: not relevant for sub content
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1aaaa20 OnStateChange 10010 about:blank
0[233e10]: SecureUI:1acf128: OnStateChange: SOMETHING STOPS FOR TOPMOST DOCUMENT
0[233e10]: SecureUI:1acf128: OnStateChange: not relevant for sub content
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1aaaa20 OnStateChange 20010 about:blank
0[233e10]: SecureUI:1acf128: OnStateChange: not relevant for sub content
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1aaaa20 OnStateChange c0010 about:blank
0[233e10]: SecureUI:1acf128: OnStateChange: not relevant for sub content
0[233e10]: read -> 3095
1316[19b9180]: read -> 30
1316[19b9180]: read -> 160
0[233e10]: read -> 30
0[233e10]: read -> 2725
0[233e10]: read -> 5287
0[233e10]: nsHttpHandler::NewURI
0[233e10]: nsHttpHandler::NewProxiedChannel [proxyInfo=0]
0[233e10]: Creating nsHttpChannel @1a176a8
0[233e10]: nsHttpChannel::Init [this=1a176a8]
0[233e10]: host=www.yahoo.com port=-1
0[233e10]: uri=http://www.yahoo.com/
0[233e10]: Creating nsHttpConnectionInfo @1afbb98
0[233e10]: nsHttpHandler::AddStandardRequestHeaders
0[233e10]: nsHttpHandler::BuildUserAgent
0[233e10]: nsHttpChannel::AsyncOpen [this=1a176a8]
0[233e10]: read -> 518
0[233e10]: read -> 0
0[233e10]: nsHttpHandler::NotifyObservers [chan=1a176a8
event="http-on-modify-request"]
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1a176a8 OnStateChange 70001
http://www.yahoo.com/
0[233e10]: SecureUI:1acf128: OnStateChange: SOMETHING STARTS FOR TOPMOST DOCUMENT
0[233e10]: SecureUI:1acf128: OnStateChange: start for toplevel document
0[233e10]: SecureUI:1acf128: OnStateChange: ++mDocumentRequestsInProgress
0[233e10]: nsHttpChannel::Connect [this=1a176a8]
0[233e10]: nsHttpChannel::OpenCacheEntry [this=1a176a8]
1332[19b97b0]: read -> 30
1332[19b97b0]: read -> 4096
1332[19b97b0]: read -> 4096
1332[19b97b0]: read -> 4096
1332[19b97b0]: read -> 4096
1332[19b97b0]: read -> 4096
1332[19b97b0]: read -> 3423
0[233e10]: read -> 135168
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: read -> 4096
0[233e10]: write -> 4096
0[233e10]: got cache entry [access=2]
0[233e10]: nsHTTPChannel::CheckCache [this=1a176a8 entry=1b24840]
0[233e10]: nsHttpChannel::AddAuthorizationHeaders? [this=1a176a8]
0[233e10]: nsHttpAuthCache::GetAuthEntryForPath [key=http://www.yahoo.com:80 path=/]
0[233e10]: nsHttpChannel::SetupTransaction [this=1a176a8]
0[233e10]: Creating nsHttpTransaction @1b21538
0[233e10]: nsHttpTransaction::Init [this=1b21538 caps=1]
0[233e10]: http request [
0[233e10]:   GET / HTTP/1.1
0[233e10]:   Host: www.yahoo.com
0[233e10]:   User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
rv:1.7.3) Gecko/20040910
0[233e10]:   Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,image/jpeg,image/gif;q=0.2,*/*;q=0.1
0[233e10]:   Accept-Language: en-us,en;q=0.5
0[233e10]:   Accept-Encoding: gzip,deflate
0[233e10]:   Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
0[233e10]:   Keep-Alive: 300
0[233e10]:   Connection: keep-alive
0[233e10]: ]
0[233e10]: nsHttpConnectionMgr::AddTransaction [trans=1b21538]
0[233e10]: nsSocketTransportService::PostEvent [event=1ad8ca0]
0[233e10]: send: fd=ee2918 osfd=612 buf=60f77564 amount=1
1204[edc160]: recv: fd=ee2888 osfd=640 buf=13dfb34 amount=1024 flags=0
1204[edc160]: recv -> 1, error = 0, os error = 0
0[233e10]: send -> 1
1204[edc160]: nsHttpConnectionMgr::OnMsgNewTransaction [trans=1b21538]
1204[edc160]: nsHttpConnectionMgr::GetConnection [ci=..www.yahoo.com:80 caps=1]
1204[edc160]: nsHttpConnectionMgr::AtActiveConnectionLimit
[ci=..www.yahoo.com:80 caps=1]
1204[edc160]:    total=0, persist=0
1204[edc160]: Creating nsHttpConnection @1b24d88
1204[edc160]: nsHttpConnection::Init [this=1b24d88]
1204[edc160]: nsHttpConnectionMgr::DispatchTransaction [ci=..www.yahoo.com:80
trans=1b21538 caps=1 conn=1b24d88]
1204[edc160]: nsHttpConnection::Activate [this=1b24d88 trans=1b21538 caps=1]
1204[edc160]: creating nsSocketTransport @1b24678
1204[edc160]: nsSocketTransport::Init [this=1b24678 host=www.yahoo.com:80 proxy=:0]
1204[edc160]: nsSocketTransport::OpenOutputStream [this=1b24678 flags=2]
1204[edc160]: nsSocketTransport::PostEvent [this=1b24678 type=0 status=0 param=0]
1204[edc160]: nsSocketTransportService::PostEvent [event=1b24e68]
1204[edc160]: send: fd=ee2918 osfd=612 buf=60f77564 amount=1
1204[edc160]: send -> 1
1204[edc160]: nsSocketTransport::OpenInputStream [this=1b24678 flags=2]
1204[edc160]: nsSocketTransport::PostEvent [this=1b24678 type=0 status=0 param=0]
1204[edc160]: nsSocketTransportService::PostEvent [event=1b24ea0]
1204[edc160]: send: fd=ee2918 osfd=612 buf=60f77564 amount=1
1204[edc160]: send -> 1
1204[edc160]: nsSocketOutputStream::AsyncWait [this=1b24728]
1204[edc160]:   calling PR_Poll [active=0 idle=0]
1204[edc160]: recv: fd=ee2888 osfd=640 buf=13dfb34 amount=1024 flags=0
1204[edc160]: recv -> 2, error = 0, os error = 0
1204[edc160]: nsSocketTransport::OnSocketEvent [this=1b24678 type=0 status=0
param=0]
1204[edc160]:   MSG_ENSURE_CONNECT
1204[edc160]: nsSocketTransport::ResolveHost [this=1b24678]
1204[edc160]:   advancing to STATE_RESOLVING
1204[edc160]: nsSocketTransport::SendStatus [this=1b24678 status=804b0003]
1204[edc160]: nsHttpTransaction::OnSocketStatus [this=1b21538 status=804b0003
progress=0]
1204[edc160]: nsSocketTransport::OnSocketEvent [this=1b24678 type=0 status=0
param=0]
1204[edc160]:   MSG_ENSURE_CONNECT
1204[edc160]:   ignoring redundant event
1204[edc160]:   calling PR_Poll [active=0 idle=0]
292[1b24868]: thread exiting
0[233e10]: read -> 1666
0[233e10]: read -> 645
0[233e10]: read -> 30
0[233e10]: read -> 296
0[233e10]: read -> 1512
0[233e10]: read -> 30
0[233e10]: read -> 643
0[233e10]: read -> 6515
0[233e10]: read -> 408
0[233e10]: read -> 30
0[233e10]: read -> 514
0[233e10]: sending status notification [this=1a176a8 status=804b0003 progress=0/0]
0[233e10]: read -> 30
0[233e10]: read -> 2695
0[233e10]: read -> 2139
0[233e10]: read -> 3539
0[233e10]: read -> 261112
1000[1815ff8]: nsSocketTransport::PostEvent [this=1b24678 type=1 status=804b001e
param=0]
1000[1815ff8]: nsSocketTransportService::PostEvent [event=2004830]
1000[1815ff8]: send: fd=ee2918 osfd=612 buf=60f77564 amount=1
1204[edc160]: recv: fd=ee2888 osfd=640 buf=13dfb34 amount=1024 flags=0
1204[edc160]: recv -> 1, error = -5987, os error = 0
1000[1815ff8]: send -> 1
1204[edc160]: nsSocketTransport::OnSocketEvent [this=1b24678 type=1
status=804b001e param=0]
1204[edc160]:   MSG_DNS_LOOKUP_COMPLETE
1204[edc160]:   after event [this=1b24678 cond=804b001e]
1204[edc160]: nsSocketTransport::OnSocketDetached [this=1b24678 cond=804b001e]
1204[edc160]: nsSocketTransport::RecoverFromError [this=1b24678 state=2
cond=804b001e]
1204[edc160]: nsSocketInputStream::OnSocketReady [this=1b2470c cond=804b001e]
1204[edc160]: nsSocketOutputStream::OnSocketReady [this=1b24728 cond=804b001e]
1204[edc160]: nsHttpConnection::OnSocketWritable [this=1b24d88]
1204[edc160]:   writing transaction request stream
1204[edc160]: nsSocketOutputStream::Write [this=1b24728 count=414]
1204[edc160]:   ReadSegments returned [rv=0 read=0 sock-cond=804b001e]
1204[edc160]: nsHttpConnection::CloseTransaction[this=1b24d88 trans=1b21538
reason=804b001e]
1204[edc160]: nsHttpTransaction::Close [this=1b21538 reason=804b001e]
1204[edc160]: nsHttpConnectionMgr::ReclaimConnection [conn=1b24d88]
1204[edc160]: nsSocketTransportService::PostEvent [event=1ba0430]
1204[edc160]: send: fd=ee2918 osfd=612 buf=60f77564 amount=1
1204[edc160]: send -> 1
1204[edc160]: nsHttpConnection::Close [this=1b24d88 reason=804b001e]
1204[edc160]: nsSocketInputStream::CloseWithStatus [this=1b2470c reason=804b001e]
1204[edc160]: nsSocketOutputStream::CloseWithStatus [this=1b24728 reason=804b001e]
1204[edc160]:   calling PR_Poll [active=0 idle=0]
1204[edc160]: recv: fd=ee2888 osfd=640 buf=13dfb34 amount=1024 flags=0
1204[edc160]: recv -> 1, error = -5987, os error = 0
1204[edc160]: nsHttpConnectionMgr::OnMsgReclaimConnection [conn=1b24d88]
1204[edc160]:   connection cannot be reused; closing connection
1204[edc160]: nsHttpConnection::Close [this=1b24d88 reason=80004004]
1204[edc160]: nsSocketInputStream::CloseWithStatus [this=1b2470c reason=80004004]
1204[edc160]: nsSocketOutputStream::CloseWithStatus [this=1b24728 reason=80004004]
1204[edc160]: nsHttpConnectionMgr::OnMsgProcessPendingQ [ci=..www.yahoo.com:80]
1204[edc160]: nsHttpConnectionMgr::ProcessPendingQForEntry [ci=..www.yahoo.com:80]
1204[edc160]: nsHttpConnectionMgr::ProcessPendingQForEntry [ci=..www.yahoo.com:80]
1204[edc160]: Destroying nsHttpConnection @1b24d88
1204[edc160]: nsSocketOutputStream::CloseWithStatus [this=1b24728 reason=80470002]
1204[edc160]: nsSocketInputStream::CloseWithStatus [this=1b2470c reason=80470002]
1204[edc160]:   calling PR_Poll [active=0 idle=0]
1000[1815ff8]: destroying nsSocketTransport @1b24678
0[233e10]: nsHttpChannel::OnStartRequest [this=1a176a8 request=1a3b410
status=804b001e]
0[233e10]:   calling mListener->OnStartRequest
0[233e10]: nsHttpChannel::OnStopRequest [this=1a176a8 request=1a3b410
status=804b001e]
0[233e10]: nsHttpTransaction::DeleteSelfOnConsumerThread [this=1b21538]
0[233e10]: Destroying nsHttpTransaction @1b21538
0[233e10]: nsHttpChannel::FinalizeCacheEntry [this=1a176a8]
0[233e10]:   calling OnStopRequest
0[233e10]: nsHttpChannel::CloseCacheEntry [this=1a176a8 status=804b001e]
0[233e10]: dooming cache entry!!
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1a176a8 OnStateChange 10010
http://www.yahoo.com/
0[233e10]: SecureUI:1acf128: OnStateChange: SOMETHING STOPS FOR TOPMOST DOCUMENT
0[233e10]: SecureUI:1acf128: OnStateChange: --mDocumentRequestsInProgress
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1a176a8 OnStateChange 20010
http://www.yahoo.com/
0[233e10]: nsHttpHandler::NewURI
0[233e10]: read -> 30
0[233e10]: read -> 2695
***************************************10 read-xxxx lines removed Mike Lisanke
************************
0[233e10]: read -> 3265
0[233e10]: read -> 30
0[233e10]: read -> 2688
0[233e10]: read -> 30
0[233e10]: read -> 2668
0[233e10]: read -> 30
0[233e10]: read -> 167
1156[140a380]: read -> 30
1156[140a380]: read -> 226
0[233e10]: SecureUI:1acf128: OnStateChange: progress: for toplevel
0[233e10]: SecureUI:1acf128: OnStateChange
0[233e10]: SecureUI:1acf128: 1ab89f4 1a176a8 OnStateChange c0010
http://www.yahoo.com/
0[233e10]: write -> 4096
0[233e10]: write -> 4096
***************************************60 write-xxxx lines removed Mike Lisanke
************************
0[233e10]: write -> 4096
0[233e10]: write -> 3064
0[233e10]: nsPluginHostImpl::Destroy Called
0[233e10]: nsPluginTag::RegisterWithCategoryManager plugin=C:\Program
Files\mozilla.org\Mozilla\plugins\npnul32.dll, removing = yes
0[233e10]: write -> 2836
0[233e10]: nsHttpHandler::Observe [topic="profile-change-net-teardown"]
0[233e10]: nsHttpHandler::StopPruneDeadConnectionsTimer
0[233e10]: nsHttpAuthCache::ClearAll
0[233e10]: nsHttpConnectionMgr::Shutdown
0[233e10]: nsSocketTransportService::PostEvent [event=1aaa800]
0[233e10]: send: fd=ee2918 osfd=612 buf=60f77564 amount=1
1204[edc160]: recv: fd=ee2888 osfd=640 buf=13dfb34 amount=1024 flags=0
1204[edc160]: recv -> 1, error = -5987, os error = 0
0[233e10]: send -> 1
1204[edc160]: nsHttpConnectionMgr::OnMsgShutdown
1204[edc160]:   calling PR_Poll [active=0 idle=0]
1000[1815ff8]: thread exiting
0[233e10]: nsSocketTransportService::Shutdown
0[233e10]: send: fd=ee2918 osfd=612 buf=60f77564 amount=1
1204[edc160]: recv: fd=ee2888 osfd=640 buf=13dfb34 amount=1024 flags=0
1204[edc160]: recv -> 1, error = -5987, os error = 0
0[233e10]: send -> 1
1204[edc160]: shutting down socket transport thread...
1204[edc160]: thread exiting
0[233e10]: Destroying nsHttpChannel @1a176a8
0[233e10]: Destroying nsHttpConnectionInfo @1afbb98
0[233e10]: write -> 4096
0[233e10]: write -> 4096
***************************************60 write-xxxx lines removed Mike Lisanke
************************
0[233e10]: write -> 4096
0[233e10]: write -> 3064
0[233e10]: write -> 131072
0[233e10]: write -> 4096
0[233e10]: nsHttpHandler::Observe [topic="xpcom-shutdown"]
0[233e10]: nsHttpHandler::StopPruneDeadConnectionsTimer
0[233e10]: nsHttpAuthCache::ClearAll
0[233e10]: nsHttpConnectionMgr::Shutdown
1316[19b9180]: thread exiting
1332[19b97b0]: thread exiting
1156[140a380]: thread exiting
0[233e10]: nsSocketTransportService::Shutdown
0[233e10]: Deleting nsHttpHandler [this=e70938]
0[233e10]: nsHttpConnectionMgr::Shutdown
0[233e10]: Destroying nsHttpConnectionMgr @140f848
0[233e10]: nsPluginHostImpl::dtor
0[233e10]: nsPluginHostImpl::Destroy Called
1220[f163b0]: thread exiting
1064[235f68]: thread exiting
0[233e10]: Unloaded library C:\Program Files\Common
Files\mozilla.org\GRE\1.7.3_2004091008\xpcom.dll
---------------------------end big logfile--------------------------------------
Mike: thanks for the log file.  for future reference, it's better to use the
"Create a New Attachment" feature of bugzilla.  that way long attachments can be
managed more easily :-)
(In reply to comment #5)
> All recent release builds should do logging but it`s possible that
> "nsHostResolver:5" is only availible in a debug build..

That logging module should be enabled by default in release builds unless the
release build was configured with the --disable-logging flag (see
"about:buildconfig" to check if your build has that flag).

FWIW, I checked and nsHostResolver:5 is not enabled on the Aviary branch, which
explains why you got an empty log file when using Firefox 1.0PR.  If you
download a nightly trunk build (of Firefox or Mozilla Seamonkey), then you
should be able to generate a nsHostResolver:5 log.

The rest of the logging output is less useful (I think) because this bug is
about DNS issues, so it'd be good to see what our host resolver thinks it is
doing ;-)
Darin,

Yah, sorry about the logfile embed. I thought better of it, but; I hate to run
scripts of any kind if I can help it. 

OK, now for the (good)bad news. I downloaded the Nightly build and before I
could test that the logging now works. Mozilla's install launch to Mozilla.org
worked. A quick test, and yes, I now had no problem to www.yahoo.com, et. al.

It seems stange that a 1.6 and 1.7.3 install has a problem, and the nightly
build doesn't. Does this mean anything to you, or anyone else reading along?

Regards,

Mike
This goes away in the Firefox trunk build for me too. Hmmm. I forgot to mention
that was happening on Firefox PR 1.0 for me.

While using the trunk is great for me, it sucks for the 1.0 release. I was
getting ready to consider switching back to something like Opera (or even,
shudder, IE) before I tried the trunk.

Personally, I think this should block Firefox 1.0, but I don't remember whether
I'm supposed to set the flag to "?" to request that (or which flag, for that
matter :-)...
Severity: normal → major
I take it back. I'm still having a problem even on the trunk nightly (it was 
less pronounced, I thought, but I suspect some kind of caching was fooling me).

Still repro's easily by going to Google News and Ctrl-Clicking on a bunch of 
different sites to get a bunch of tabs.

However, I still can't get "set NSPR_LOG_MODULES=nsResolver:5" to put anything 
in the log file (it creates an empty one, which stays empty), even with the 
latest trunk nightly build (and about:buildconfig does not show the --disable-
logging flag). 
(In reply to comment #12)
> However, I still can't get "set NSPR_LOG_MODULES=nsResolver:5" to put anything 

try using nsHostResolver:5
Attached file nsHostResolver log (obsolete) —
Doh!!!
Attachment #160419 - Attachment mime type: application/octet-stream → text/plain
This is getting intolerable lately. The most recent 1.0PR build (20041025) seems
(subjectively) to be even worse about this, but it could just be my frustration
level building. 

In spite of having invested a lot of time in Mozilla, I'm getting close to
switching to Opera or something because of this problem. Even IE is starting to
look appealing...

Requesting 1.0 block (hope I did that right :-). 
Flags: blocking-aviary1.0?
Hi Ray,

Unfortunately the log file doesn't show us anything useful :-(  Thanks for
capturign though.  We'll have to think of another way to uncover what is
troubling your system.  I've heard of a few other reports of DNS problems, but
I've never experienced any myself.  It's very difficult to diagnose a problem if
you can't see it! ;-)

Can you please try changing the logging to:

  NSPR_LOG_MODULES=nsHostResolver:5,nsSocketTransport:5

That way I might be able to deduce further what is going on.  Thanks!
The problem may be related to DNS response timing. For me, the problem
disappeared (without change to Mozilla) on my XP system. Also, changing to a
nightly build on the Win2K system also 'fixed' the problem. It is not clear what
was wrong, except that certain DNS look-ups failed for Mozilla only. 
> The problem may be related to DNS response timing. For me, the problem
> disappeared (without change to Mozilla) on my XP system. Also, changing to a
> nightly build on the Win2K system also 'fixed' the problem. It is not clear what
> was wrong, except that certain DNS look-ups failed for Mozilla only. 

Under WinXP we use the Microsoft getaddrinfo API, whereas on Win2k and other
older versions of Windows we use the traditional gethostbyname.  getaddrinfo is
used so that we can support IPv6 if enabled on the host system.  I think we
revert to gethostbyname if we detect that IPv6 is not enabled (i.e., if creating
an IPv6 socket fails).

This leaves us having very little option since we are simply invoking a
Microsoft API for DNS.  Mozilla does not even know that DNS will be used to
resolve hostnames, so there is usually very little we can do in these situations :-(

The exception may be if we find that we are not calling the API correctly or
something like that, which is unlikely since the API is so simple.
Here's a log as requested. At the end of the log, I'm still stuck trying to
load some images on the web pages I opened, but it did finish loading all the
main pages, so this log shows examples of it eventually working as well as
indicating the stuck state. 

Hopefully this will turn out to be useful. 

BTW, I'm running XP, but I've run XP in the past without having this problem.

Also, I've tried disabling IPv6 DNS resolution (by setting
network.dns.disableIPv6 to true), and that had no effect, in case that's an
anti-clue. 

The documentation for the getaddrinfo API doesn't look especially simple to me,
but perhaps that's just because it's been a long time since I did any network
programming.
Attachment #160419 - Attachment is obsolete: true
Thanks for the log file Ray... you may also want to try adjusting these prefs:

  network.dnsCacheExpiration
  network.dnsCacheEntries

Both are integer values, where dnsCacheExpiration is the number of seconds to
hold a lookup result in mozilla's DNS cache, and dnsCacheEntries is the number
of entries to store before evicting old entries.  These are "hidden" prefs in
that their default values are not listed in about:config, but you can easily
just use the "right click" and select "New->Integer" method to add these.

Please let me know if adjusting those prefs helps any.
NOTE: older versions of Mozilla used to cache DNS lookups indefinitely, but
people complained that that caused problems.  So, we reduced our default
expiration time to 60 seconds, and keep only 20 entries.

If modifying those prefs helps you, then we may want to consider bumping up the
value of the defaults to something higher.

As I said before since Mozilla uses the operating system's resolver API, there
is no way for Mozilla to infer DNS TTL values that could be used to build an
optimal DNS cache.  As such, we are left with the crude method of applying a
global expiration value.
Well, upping the dns cache limits seems to help mask the problem to the point
where it might be bearable if I cranked up the expiration time to several days
and the number of entries to 1000 or so. 

The experiment did give me some insights, though: it looks to me like at some
point down the line, Firefox gets stuck on one of the DNS lookups, and then
*all* the others get wedged behind it until some really long timeout. Once that
one stuck one lets go, the rest seem to resolve pretty quickly.

An experiment with IE shows that it doesn't have any kind of queue of these
requests... it seems to do them all in parallel. For example, on the same set of
test sites from news.google.com, Firefox got stuck on most of them, and only
eventually freed itself up and loaded them all. With IE, it got stuck on one of
them (salon.com for some reason), but all the rest loaded immediately. 

This raises another important clue: it's not *just* that my machine is
configured in some weird way or is connected through some quirky ISP that
prevents it from doing DNS lookups. Other apps do this fine. But there *is* some
quirk that causes DNS to get somewhat stuck for me on some occasions.

Even in the IE case, though, it seemed to recover much faster than Mozilla. It's
almost like it had a shorter retry interval or something. 

The nice thing about cranking up the DNS cache on Firefox, though, is that once
I've gotten to a site, it seems (subjectively) like I can at least open other
windows on that site pretty reliably even if FF is stuck on some other DNS
resolve. This will keep me using/testing it for considerably longer than if I
didn't have any workaround... I guess I'd say that it still sucks, but at least
I can live with it for now.

I've done enough testing on this now to feel pretty comfortable that it's a real
FF bug, so I'm confirming it. Please accept my apologies if that is the wrong
thing to do for some arcane reason :-).
Status: UNCONFIRMED → NEW
Ever confirmed: true
putting on the blocking list until we know more.  we would need a low risk patch
and lots of testing before landing on the branch for firefox 1.0
Flags: blocking-aviary1.0? → blocking-aviary1.0+
I see this problem too (mozilla, not ff): sporadically mozilla will cease to
resolve any hostnames, even though ping and nslookup and other programs can
contact sites just fine.  Restarting mozilla a couple times always cures the
problem (often the first restart doesn't do it, but so far the second restart
has always cured it).  Clearing the cache and/or setting cache size to 0 doesn't
cure it.

Mozilla 1.7.3 (and earlier versions -- I've been seeing this for a while) on
Debian unstable with kernel 2.6.8.1 (and various earlier versions).  CONFIG_IPV6
is not set, if that matters (I've seen mentions of that in other DNS-related
bugs while searching for this one).
Hmm, before akkana's comment I thought this might be related to the fact that we
are using gethostbyname on Windows.  I thought that maybe the OS was serializing
all threads accessing that function.  But, the fact that akkana's says she sees
something like this on Linux, makes me suspect that it might be a problem in the
DNS thread pool.  Hmm...
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.8beta
hm... seems to me like this is the same as most if not all duplicates of bug
240759. maybe one of those provides more info.
Windows NT XP and 2000 have a system service that cahches responses to DNS
queries.  Nslookup does not use it.  When it gets hosed you see symmptoms
similar to those reported in this bug.  Unless this bug can be reporduced wiht
the DNS Client service disabled inthe operating system.  I would suspect this is
not a valid bug.
Just repproduced it with the XP DNS client service stopped. 

Also, it's been reported on Linux (comment 24).

Additionally, I would be extremely surprised to find that IE doesn't use the DNS
caching service (in fact, why else would they even bother implement a DNS
caching service?). And it doesn't have this problem at all.
Well, there's one way to find out if this is a problem with the Windows DNS: the
next time that it happens, run IPCONFIG /FLUSHDNS and see if the problem goes away.

If it does, it's a problem with Window's DNS cache.  If it doesn't, it's a
problem with Mozilla.
Re: comment 29. There's no effect from ipconfig/flushdns. I see 2 possible ways
it could have had an impact: freeing up an already "stuck" FF, and allowing a
non-stuck FF to work properly after the flush. I tested both, and
ipconfig/flushdns has no effect in either case. 

Of course, now that I've bumped up my dns cache limits, it's getting
increasingly difficult to test this :-). But luckily Google News supplies me
with a nearly unlimited number of weird sites I've never been to... 
It is odd that such a major problem would be reproducible by only a relatively
few users.  I'm not convinced that the Linux problem (reported by akkana) is at
all a duplicate of this WinXP one.

Ray: does your computer have multiple CPUs?  hyper-threaded?  etc.?  What about
you, akkana?
Yes, my machine is hyperthreaded. It's a pretty new one (3.2 GHz).
Interestingly, I switched to that machine at a time not too drastically
different from when I started noticing this problem. I would take that with a
grain of salt, because I can't swear to an actual correlation, but it might be
an interesting avenue to investigate. Deadlocks in the DNS thread pool that only
show up on a fast (effectively) multi-processor system are a distinct possibility.

I suppose I should install FF on my even newer Media Center PC at home and see
if it happens there... not sure when I'll be able to get around to that,
though... I don't use it for browsing much (TVs make sucky monitors).
I regularly use a SMP linux box and have yet to encounter a problem like this,
but it would not surprise me if there was a race-condition of some sort here.
If I were the type to offer advise :-), I'd suggest doing some thought
experiments along the lines of "what would happen if a particular DNS lookup got
hung or failed at the same time that X happened" where X includes: 1) another
DNS request came in (especially if it "came in" due to needing to look up an
address for an image that was just parsed on the main page of some tab), 2)
another thread's DNS lookup succeeded/failed, 3) you ran out of threads in some
pool, 4) some mutex (perhaps indirectly) protecting some part of the DNS system
is held, and 5) others that I can't think of right now...

Also, generally examining the re-entrancy of the DNS subsystem wouldn't hurt.
I'd offer to do it, but it would take me a very long time to come up to speed on
the source architecture and I'm really busy right now.

The idea of the code "spontaneously" needing to do another lookup is especially
interesting to me, because one of the times I most often see this bug is when a
main page has finished loading, but its external images aren't loaded yet.
That's a sufficiently common occurance that I wouldn't get too excited about it
(could just be statistics), but it might be a clue.
Here's the code for the thread pool and asynchronous host resolver:
http://lxr.mozilla.org/mozilla/source/netwerk/dns/src/nsHostResolver.cpp#377

It's dependencies on other mozilla code are limited to NSPR and pldhash.
re comment #28.  I guess I should have read through all the comments on this bug
a little more closely.  I thought the issue was that the same hostname that
resoves in nslookup does not resove in Mozilla.  I have had issues like this and
the dns caching was the cause, but in those cases the name did not resolve in IE
either.  However I now remember having had issue like this in mozilla in the
past and I concluded that the issue might be this one:

http://support.microsoft.com/default.aspx?scid=kb;en-us;815768

So, I disabled IPV6 on my system and have not had a similar issue since.

But none of this expains the commenter who said they are having similar issues
under linux.
Sorry for bugsapm on this re comment #36.   I realized after posting that this
is not the same knowlegbase article I was talking about.  I cant seem to find
the one I had before.  Perhaps it wasn't on the microsoft site.  The one I had
found before said a workaround was to disable IPV6 support and told you how to
accomplish that.
Ok, time to expose my complete ignorance of the codebase:

At line 549, it looks like the code is waiting for some condition to become true
(or a timeout occurs) while holding the lock (from line 535). Is this correct?
If so, I'm *extremely* suspicious of this, because ResolveHost grabs the lock
before it calls IssueLookup to hand its request off to the idle thread (which
has the lock held). This would be deadlocked until line 554 is hit. 

The only question would be why this doesn't happen more often...

Some other ideas/questions:

At line 511: it looks like IssueLookup can return NS_OK without actually issuing
the lookup if mThreadCount >= MAX_THREADS. Perhaps that's not harmful, but I'm
having a hard time following the consequences. 

At line 388: does result's constructor initialize it to NULL or something
equivilent? Otherwise it's used without being initialized at line 451. (grasping
at straws... this appears to be done in so many places that I have to believe it
does the right thing). 


Just one CPU here, a P3.  I suspect I'm seeing the same issue reported by many
people (on various platforms) in bug 146769, which may or may not be the same
issue as the Windows people here are seeing.  It's unfortunately very sporadic
and I haven't detected any regularity to what's happening at the time; certainly
I do use tabs heavily (lots of the "lost DNS" bugs seem to mention opening links
in new tabs, though I'm not clear why that would matter) and when it happened
yesterday, there was some network flakiness that afternoon (mtr was showing some
lossage), so perhaps DNS was slower than usual and that causes a race condition
somehow.

Anecdotally, yesterday I mentioned the problem on a linux-heavy IRC channel (to
see if anyone else had trouble with http from a host I was trying -- no one did)
and got several "Oh, my mozilla/firefox does that too" comments (where host
resolution stops working until a restart).
so, just to be clear the symptom is that the browser appears completely unable
to resolve hosts until a restart?

ray: is that what you see too?  or in your case are there just unusually long
delays?
> At line 549, it looks like the code is waiting for some condition to become true
> (or a timeout occurs) while holding the lock (from line 535). Is this correct?

yes


> If so, I'm *extremely* suspicious of this, because ResolveHost grabs the lock
> before it calls IssueLookup to hand its request off to the idle thread (which
> has the lock held). This would be deadlocked until line 554 is hit. 

when you wait on a condition variable, you implicitly release the mutex.  hence,
ResolveHost is allowed to enter the mutex, and eventually notify the condition
variable inside IssueLookup.


> At line 511: it looks like IssueLookup can return NS_OK without actually 
...

we add the new lookup to mPendingQ before testing mThreadCount.  if we find that
there are already active threads (and if the max thread count is reached), then
we assume that one of the busy threads will process our lookup once it finishes
with the one it is processing.


> At line 388: does result's constructor initialize it to NULL or something
> equivilent? Otherwise it's used without being initialized at line 451. 

yeah, nsRefPtr initializes its "value" to null by default.
OK please ignore all my previous comments on this bug (you probably are anyway :-) )

What I was mis-remembering is a previously reported bug 219088.

The issue was a bug in the DNS Client service in Windows.  The issue had to do
with resolving hostnames that resolved to a CNAME record that pointed to a host
with both A and AAAA records (and I think you had to have IPV6 installed on the
PC too) to see the issue.  The problem was that only the AAAA records were being
returned and not he A records in the CNAME case.

As far as I know this issue was subsequently fixed by Microsoft, but I am not
100% sure of that. 

So some of the people seeing this issue could actually be seeing the bug 219088
issue which turned out to be a Windows OS issue.

The workaround for that issue was to not run the DNC Client service so that it
was not used at all for queries, either by stopping the service via the control
panel or with the command "net stop dnscache".

IPCONFIG /FLUSHDNS as used in comment #29 is not sufficient.

It sounds as if there is a new issue here having to do with the way mozilla does
the name resolution that appears to effect multi-threaded processors only. 
However, I still think it would be a good idea for people working this issue to
disable the DNS CLient service just to be sure it is not the cause of the problem.
If you'll tell me how to turn on debug tracing for the condition variable
locking scheme(s), I'll try to look for a deadlock that way... I've gone back to
using the 1.0PR branch, but I might have time to revert to the trunk if needed
to get the appropriate output.

Anyway, I'm always suspicious about code that looks like it's holding a lock
throughout that actually releases/regrabs it implicitly somewhere in the middle... 

For example, it looks to me like the variable mHaveIdleThread isn't effectively
protected by mLock. If I'm reading it right, 2 threads can both set
mHaveIdleThread while only 1 of them is actually waiting for the condition
variable via the underlying OS mutex (the second thread can grab mLock for a
very brief time after the first thread unlocks it inside PR_WaitCondVar). As a
result, the one that is actually waiting can be freed up by line 509, and set
mHaveIdleThread=FALSE before the second thread even waits. Now we have the
situation where there's an idle thread, but the rest of the system thinks there
isn't one. 

That might be harmless (hard to tell), because it will just create a new thread
if it thinks there isn't an idle one... *unless* we get to the state where
mThreadCount == MAX_THREADS. Then IssueLookup will neither create a new thread
nor signal the waiting one. But then eventually the one that *was* freed will
come back and become free again, I guess... bleah... what a mess. <----Brain
melts down here.

Also, I can't quite figure out whether there are actually 2 mutexes/critical
sections here: mLock, plus some kind of internal lock protecting access to the
cvar struct itself (there are calls to something called _PR_CVAR_LOCK(cvar)
inside _PR_WaitCondVar in addition to what looks like a thread lock... as well
as locking and unlocking mLock). If there really are 2+ locks here, then there
are also potential inversions in the order locks are grabbed that would have to
be identified and analyzed, because of course these could be leading to deadlocks.

All of this makes it really hard to debug this problem by analysis...

(side note: I don't know what happens here if 2 threads are waiting on a
condition variable when it is released... this would depend on the exact kind of
mutex that is used. Some kinds will free all threads, some will free only 1,
some have other behaviors... I got lost digging that out of the source, so I'm a
bit unclear on the exact behavior of the above scenario... Therefore, I worded
it so it doesn't matter, because only 1 thread will actually be waiting... if
the mutex is a "single release" kind, then this scenario can happen much more
easily).

Seeing this bug on WinXP sp1 and sp2 FF1.0.1PR

Seems random and non-replicatable, but still occurs infrequently.. anecdotally I
seem to recall that it happened after an initial problem loading a request,
thought it was a problem with firewall at first, but have confirmed that's not
the case, then thought it was the fact that windows update was running at the
same time, this also not the case..
no action. gone.
Flags: blocking-aviary1.0+ → blocking-aviary1.0-
Hmmm. Thinking about this some more, what about this scenario:

1) The max number of threads are allocated, and all the threads are idle and
waiting on the condition variable.

2) A new request comes in, releasing one thread, which sets mHaveIdleThread to
false and calls PR_GetAddrInfoByName.

3) For some reason, this hangs (perhaps DNS is unhappy for some reason for that
particular host).

4) No further requests can be processed. This happens because mHaveIdleThread
was cleared when GetHostToLookup was released from its wait on the condition
variable. As a result, we have mThreadCount == MAX_THREADS and mHaveIdleThread
== FALSE, so IssueLookup neither creates a new thread nor signals any of the
waiting ones. Hung until either the pending DNS lookup finally finishes, or one
of the idle threads times out. 

This would only happen if you managed to saturate the thread pool with requests,
so that might explain why this isn't seen very much, because as far as I can
tell, this will happen every time. Of course, perhaps DNS lookup failures also
are rare.
Ray,

I think you may have figured it out!  I think mHaveIdleThread should be a
counter instead of a boolean.  Well done!  I'll cut a patch, and then we can try
it out on the trunk and see if it solves the problem.
Sigh... I take that back.  In fact there can only ever be one idle thread.  We
let the rest of the threads terminate.  So, a boolean is perfectly fine.  I
forgot that small but significant detail when I first read your comment.

Do you have any reason to believe that all 8 resolver threads are blocked?  Do
you have a copy of MSVC so that you can break into the debugger and look at the
call stacks for each thread?
The threads *do* terminate, as you say, but only after a 1 minute (?) timeout
has elapsed, and then only if the queue actually stays empty for that whole
time. If requests trickle in, they will get picked up and handled (*very*
slowly, because we're relying on individual unrelated timeouts expiring), but
then the threads will (potentially) wait a minute again.

The only part I don't understand is that it (subjectively :-) seems like much
longer than 1 minute. I'll go measure that (after I hit "Commit" on this comment
:-). 

I think a counter is a better idea than a boolean in any case, because it's
definitely a wedging that easily can, but shouldn't, be happening. 

Also, if the DNS hang is reasonably common and semi-permanent for some reason
(perhaps one of the Windows bugs mentioned earlier), then the various threads
will slowly become wedged over time and the pool will get smaller and smaller,
which will force this wedging to happen more and more frequently.

I suppose it might be nice to have a watchdog on the DNS threads to avoid such
problems, but perhaps that's asking too much...

Of course, this is still all hypothetical. This might not be where we're getting
wedged at all... it could, for example, be some kind of deadlock-with-timeout
inside Windows' version of getaddrinfo that only a few people's machines are
experiencing. 

I do have MSVC... any suggestions for (hopefully a not-too-time-consuming) way
to proceed with that? I suppose if I break in and "a bunch" of threads are all
waiting somewhere, that would be a clue... but I probably don't have time to
figure out how to set up a whole build environment in order to get myself some
symbols to debug against...

we only wait on the condition variable if !mHaveIdleThread, which means that
there should only ever be one thread waiting on the condition variable.

yeah, building mozilla on win32 is somewhat involved since it requires cygwin. 
even without symbols, the stack traces may be valuable since MSVC will tell us
the DLLs for each function on the stack.  if we see > 8 threads coming from
necko.dll stuck inside ws2_32.dll then that will be a good clue that those
threads are stuck inside win32 resolver code.
Dang... that seemed so promising. 

I got it running in MSVC, but I'm getting the very weird result that all the
threads are stuck at the same address in ntdll.dll (and I don't happen to have
the symbols for that to see if that's a "reasonable" place). 

The weird part is that I mean *all* the threads, main and otherwise. That's just
too weird. So I'm quite suspicious of that result. If it were true, that would
point to either a Windows bug or some kind of API misuse. I'll go try to debug
some more. 

I'm downloading symbols (and also WinDbg, because it can often resolve stuff
that MSVC 6.0 can't). Normally, I'd already have those on my machine... but I
switched machines a few months ago and haven't had the need until now...
Ok, I think I may have tracked this down. Here's a comment I found at a network
optimization site (http://snakefoot.fateback.com/tweak/winnt/network.html):

--- Start of quote ---
Note if having activated SYN-Attack-Protection or installed WinXP SP2, a limit
is introduced on how many connection attempts (half-open) one can make
simultanously. This will limit worms like blaster and sasser from spreading too
fast, but it will also limit other applications that creates many new
connections simultanously (Like P2P). 
EventID 4226: TCP/IP has reached the security limit imposed on the number of
concurrent TCP connect attempts
--- End of quote ---

Looking in my event log, I find that indeed I'm seeing this event being logged
at around the time when this "freeze" is happening. So, it is likely the MS is
gating on the number of simultaneously open connection attempts and thereafter
letting Firefox only make connection attempts at some fixed (slow) rate. 

The question then becomes, what is IE doing differently that doesn't invoke this
protection. I'm guessing that perhaps it persists its connections or avoids too
many connections in SYN_WAIT state or something.

I'll continue to investigate. 

Other reporters should check their event log (Control Panel->Administrative
Tools->Event Log->System) for this error to double check my result.

If this is indeed the problem, the question becomes what to do about it. It's
possible to just say "it's a Windows bug", but that's quite unsatisfying. If I
have more of a complaint with the intent of what MS is trying to do with this
security "fix" I would tend to be less irritated, but what they are trying to do
seems reasonable, actually. 

Attached file Stack traces
I'm feeling pretty postive that my last comment is on track, but just in case,
here's a couple of stack dumps from WinDbg showing where Firefox's threads are.


Several of them are waiting on condition variables (though there's not enough
stack information to really be sure of that), and there's one interesting
commonality between the two dumps in this file, which is that one of the
threads is actually in PR_MD_UNLOCK, which is (seemingly) calling select().
I've seen some other indications that perhaps Windows has some performance
issues with select in some circumstances, but in any case it seems odd that
both dumps would show the same situation, so I thought it would be worth
mentioning.
Ok, I've found out why I'm one of the few people that sees this, and have a
workaround.

I had bumped up my max connections parameters (I occasionally want to download a
large number of files from very slow servers in parallel). I was allowing a max
of 100 persistent connections, and something like 30 per server. 

When I reset these all to the defaults, this problem went away entirely.

I'm ok (not hugely happy, but ok) with that as a workaround, but a true fix
would be to separately limit the number of half-open connections to the XPSP2
limit to avoid triggering this security "feature".

The only fix to in XPSP2 appears to be patching the TCP/IP stack, which I'm not
willing to do (in SP2 there's no registry key controlling how many half-open
connections it will allow, unlike the older SYN attack fix). 

Clarification: Total open connections aren't limited in SP2, just incompleted
"half open" ones. Once this limit is hit, your process is hosed for good.
BTW, the XP SP2 limit I mention in comment 54 is 10 simultaneous half-open
connections. So with Firefox's default max connections of 24, there is still the
possibility that this could happen if you really banged on it hard (very
unlikely, but possible). 

I'm thinking that if you looked at a page with images loaded from 30 different
sites, that that would prompt Mozilla to possibly trigger this even with the
default max-connections, so I'm going to try that when I have time. 

As for why this looks like a DNS issue (superficially?), perhaps someone with
more knowledge of Firefox's TCP intricacies than I could say, but I'm guessing
that that just happens to be the most common message to get "stuck" in the
status bar...
great work tracking this down Ray!!!

the default max number of connections applies to all sites.  we limit the number
of persistent connections to just 2 per site.  so, you'd really have to exercise
the browser in some extraordinary way to trigger this bug given our default prefs.

downgrading severity to minor since this does not happen or at least is
virtually impossible to replicate given the default mozilla prefs.  we don't
really support modifying the default http connection limits.  the defaults are
set according to the recommendations of RFC 2616 (as well as some basic
performance tuning i did several years back).

but, yeah... it would be good to address this issue better.  perhaps by putting
a hard cap on the number of persistent connections, or something like that.

futuring this bug.  patches welcome :)
Severity: major → minor
Keywords: helpwanted
Summary: DNS name resolution fails in mozilla; suceeds with nslookup; dotted IP addr works for URL → DNS name resolution fails under WinXP-SP2 when max persistent connections bumped up to 100
Target Milestone: mozilla1.8beta → Future
I agree with the severity assessment. I tried to really beat up on it this
morning. I tried creating a document with tons of external images to different
sites, shift-refreshing frequently... trying to get it to make all 24
connections simultaneously. I also tried rapidly Ctrl-clicking on a ton of links
on news.google.com.

Only once did I manage to get it to trigger XP's connection limit security
feature (verified in the event log), and I couldn't repro that afterwards even
if I stopped XP's DNS cache service, set the DNS cache limit to 1/1 in Firefox,
and cleared the cache between attempts.

If this gets implemented in the future, the key (from what I can figure out
looking at the security feature's documentation) will be limiting the number of
connections that are in the SYN_WAIT state to 10 (I believe this state indicates
that the TCP connection request has been sent out but no response has yet been
received). 

The one thing that worries me about my failure to repro this as described above
is that the intervening routers cache various bits of connection state, so it
may very well be that I couldn't repro it because the connections were all being
made very quickly the second and subsequent times (I have only so many unique
sites I can find in large numbers in one place :-). 

Anyway, there's no need to hard limit the total number of active connections to
10, and indeed I would find that very inconvenient. 
Well, I'm happy that you found and resolved a problem with Mozilla, but; since I
found the original reported problem on W2K and XP SP1, and since it wasn't
confirmed to be resolved by increasing the number of connections in WXP-SP2
allows. I don't know why you didn't choose to create you own bug report and add
your fixes to it. I believe you resolved a much more narrow problem than was
first suggested, and also, I know I read of reports of similar symptoms on
Linux. Now, what are those users to use as a bug report??
He's right, this looks like a totally different bug (originally it looked similar). 

I hate to suggest this, because it's "practical" solution that could be
construed as disrespectful to the original bug poster, but given that almost all
the discussion here is about the bug I found, perhaps the most effective thing
to do would be to re-post the original bug so it can be investigated separately?

My apologies for (unwittingly) hijacking your bug. 
Ack, yikes... sorry... I too didn't mean to morph the original bug report so badly!
Ok, I have enough experiencial evidence to conclude that the problem I originally
reported with MOzilla was in fact some problem with the DNS cache service on
windows. It turns out, not all tools (e.q. nslookup must resolve with this
service) and so some tools can receive good look-ups while Mozilla could miss
some host URL. It appears the windows DNS cache service freezes up at some
point, and changing host IP do not get resolved. Stopping and restarting (or
just disabling the service) resolved thia problem on all systems, and all
Mozilla versions considered. Sorry for the overlap with the other actual
problem. If somebody knows how to split the problem record (or leave this
embedded) they are welcomed to.

Best regards to all,

Mike
marking this bug invalid.  i think in both cases, the bug is not really with
mozilla but with the underlying operating system.  or, you could say in the
multiple connections case that the bug is really a feature of the OS, and that
the problem here is that the user configured mozilla in an unsupported way. 
that'd still make this bug report invalid.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: