If Mozilla is waiting for a DNS response and the user Quits, Mozilla will hang. Steps to reproduce: 1. Try to access <http://www.imagemagick.org/> using the DNS at <126.96.36.199> 2. Observe Mozilla trying to resolve that URL 3. Quit Mozilla. Expected results: Mozilla should abort the DNS session and quit. Actual results: Mozilla hangs.
(See also bug 192272 for a crash that occurred in this state.)
This bug is not Mac specific: http://bugzilla.mozilla.org/show_bug.cgi?id=193827 In happens because DNS resolution in tcp mode seems not working right
I've been seeing both this and bug 193827 (inability to resolve new sites) on Linux. I'll attach the stacks from Linux shutdown.
OS: MacOS X → All
Hardware: Macintosh → All
the situation is that we have called gethostbyname, which may block until the OS either gets the DNS result or determines that it cannot get the DNS result. because of network problems or just slow DNS servers, gethostbyname can block for a relatively long time. the solution we've been planning is to spawn multiple threads (up to some limit) for calling gethostbyname. this will help keep the browser usable while an existing gethostbyname is blocked. as for the shutdown problem, we might want to look at making the threads unjoinable... or find some way to cancel the gethostbyname call. there's an uber-bug for this problem somewhere...
Hmm. I wonder if this (and bug 193827) was the issue I mailed darin about last week. On unix, we can use pthread_cancel, but I think we need ntpl to use it. man pthread_cancel says: POSIX specifies that a number of system calls (basically, all system calls that may block, such as read(2), write(2), wait(2), etc.) and library functions that may call these system calls (e.g. fprintf(3)) are cancellation points. LinuxThreads is not yet integrated enough with the C library to implement this, and thus none of the C library functions is a cancellation point. and SUS says: If a thread has cancelability enabled and a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point, the thread shall be awakened and the cancellation request shall be acted upon (gethostbyname is a cancellation point) Does RH9 still have that text in the manpage? I believe that the new threading stuff fixed that. Can someone test? How do we quit, anyway? It looks the dns thrad calls nsThread::Join, which calls PR_JoinThread, but that will block until the thread exists or is cancelled. Don't we have to cancel the thread instead? wtc, is there an NSPRd pthread_cancel we can call (which would then presumably work with NPTL). LXR doesn't find a call to pthread_cancel, so I'm guessing not. I don't think that this is solvable via linuxthreads, since once we're blocked, we're stuck. We could set the cancellation state to PTHREAD_CANCEL_ASYNCRONOUS, but we'd have to test that for linuxthreads. An explicit pthread_cancel may still work, I guess, although that man page text doesn't seem encouraging. Does this happen on windows, btw? (Or another non-unix-based os) How do bsd-style os's handle cancellation points?
I think we can avoid much of the thread complications by redesigning the DNS service to use multiple unjoinable threads. I don't have a complete design in my head at the moment, but after playing with similar issues in the disk cache, I can begin to see the light. Darin, do you want me to take this? I don't mind.
Can we mark this as a dupe? Is there any additional information needed here? I think it sounds like we have enough technical firepower to now agree this is a real problem, so I want to go hunting for all the unconfirmed dupes that have piled up over time. I am also likely to be increasingly behind on bugmail these days, so consolidation of bugs is a high priority for me.
Component: Networking: HTTP → Networking
QA Contact: httpqa → benc
Perhaps we could kill (9) the DNS thread if it still exists when we're finally ready to exit (perhaps after NSPR is otherwise fully shut down)?
dependent on DNS servers, not a lot users complaining and any fix is likely to be a bit scary. At this point, we're not going to block on this.
Flags: blocking1.4? → blocking1.4-
>The situation is that we have called gethostbyname, which may block until the OS >either gets the DNS result or determines that it cannot get the DNS result. No, this is not the case in this bug. The mozilla does not exit after any timeout. It still running in background after 24 hours. The problem is that mozilla's DNS gets into some corrupt state.
Vladislav: thanks for the info, but what version of mozilla are you testing?
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.5alpha
This problem exists with all mozill versions I tried Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030210 Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 The problem exists for years, sinse netscape. Note that to get mozilla to this state you need to set DNS record such that the request is made in TCP mode. Then do File->Quit and then ps axuww|grep mozilla See bug http://bugzilla.mozilla.org/show_bug.cgi?id=193827 which also imcludes packet traces that indicate DNS activity.
I've had this happen too, on RH9. Strace shows waiting in futex_wait. the browser will just stop working, and then quiting doesn't actually quit, and restarts hang on the x-remote ping() from the shell script.
well, if the DNS thread ever were to deadlock, then on shutdown or restart we would indeed hang the entire browser when the UI thread joins with the DNS thread. so, sounds like we have a real race of some sort to unravel here. the DNS rewrite (bug 205726) should help since i think we can greatly simplify the thread synchronization.
Summary: Mozilla hangs during Quit while waiting for cranky DNS → DNS: hangs during Quit
ok, now that the patch for bug 205726 has landed, DNS pinning is now a think of the past. this bug is fixed (note: on the trunk only).
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
V. No reports of this for some time.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.