Created attachment 312610 [details] Report from Apple crash reporter From bug 337418: Comment #54 Florian Quèze 2008-03-28 10:49:15 PDT With my trunk build (which is already a bit old: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9b4pre) Gecko/2008022621 Minefield/3.0b4pre) I sometimes get crashes with a stack like this one: Thread 1 Crashed: 0 <<00000000>> 0xffff07fc __memcpy + 92 (cpu_capabilities.h:228) 1 libnecko.dylib 0x143e08aa nsDNSRecord::GetNextAddr(unsigned short, PRNetAddr*) + 460 (nsDNSService2.cpp:143) 2 libnecko.dylib 0x143cf4be nsSocketTransport::RecoverFromError() + 426 (nsSocketTransport2.cpp:1242) 3 libnecko.dylib 0x143cf61d nsSocketTransport::OnSocketDetached(PRFileDesc*) + 173 (nsSocketTransport2.cpp:1559) ... I guess adding back NS_ENSURE_STATE(mHostRecord->addr) (proposed in comment 29) would fix it. Not sure if it's better to reopen this bug, or bug 290190 (where this was added) or file a new one. Comment #55 Wan-Teh Chang 2008-03-28 16:30:33 PDT Please open a new bug and cite this bug and bug 290190. Tell us in the new bug whether the URL has a host name or literal IP address.
Created attachment 312612 [details] [diff] [review] trivial patch Trivial patch replacing the assertion by NS_ENSURE_STATE like it was done before in bug 290190. (In reply to comment #55) > Tell us in the new bug whether the URL has a host name > or literal IP address. > How can I know this? I see this crash only once or twice a week so adding a printf somewhere would not be a quick way to know... When I have this crash, it's always just after a network connection drop (bad wifi at school) so I guess returning a failure for the DNS resolution would be OK.
We should find out why the nsHostRecord object doesn't have the required property: (http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsHostResolver.h#89) a fully resolved host record has either a non-null |addr_info| or |addr| field. By code inspection, I believe an nsHostRecord may get into the state where both the |addr_info| and |addr| fields are null if we call nsHostResolver::OnLookupComplete with |result| equal to null. This can happen after PR_GetAddrInfoByName fails: http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsHostResolver.cpp#701 695 ai = PR_GetAddrInfoByName(rec->host, rec->af, flags); ... 701 // convert error code to nsresult. 702 nsresult status = ai ? NS_OK : NS_ERROR_UNKNOWN_HOST; 703 resolver->OnLookupComplete(rec, status, ai); We should find out why an nsHostRecord in this state can still be passed to nsDNSRecord::GetNextAddr. If that is not a bug, we should update the comment quoted above to reflect this, and the right fix would be for nsDNSRecord::GetNextAddr to return NS_ERROR_NOT_AVAILABLE if both mHostRecord->addr_info and mHostRecord->addr are null. NS_ENSURE_TRUE causes a debug break in debug build (and returns NS_ERROR_UNEXPECTED), which isn't the right way to handle an expected error. On the other hand, if it is a bug for an nsHostRecord in that state to be passed to nsDNSRecord::GetNextAddr, we should fix the bug rather than just trying to not crash.
Christian, Boris, Darin, please see my findings in comment 2. If we fail to get the fresh addr info for a host: http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsHostResolver.cpp#701 695 ai = PR_GetAddrInfoByName(rec->host, rec->af, flags); ... 701 // convert error code to nsresult. 702 nsresult status = ai ? NS_OK : NS_ERROR_UNKNOWN_HOST; 703 resolver->OnLookupComplete(rec, status, ai); should resolver->OnLookupComplete keep the expired rec->addr_info or blow it away? Right now we blow it away, and this affects previously created nsDNSRecords that contain |rec|; their nsDNSRecord::GetNextAddr method will start to fail because they have no address to return now.
(In reply to comment #3) Don't know if the following is relevant, but just in case -- A few years ago (around Fx 0.10 or 1.0 IIRC, or was it NS6?) we kept DNS addresses for as long as the browser was online (i.e., hours or days), and that caused timeouts when the remote host changed its DNS record (e.g. because an alternate server or connection had to be brought up online because of some hardware failure). The workaround, which was completely unobvious for the newbie I was then confronted with a solid browser timeout at a given site, was to bring the browser off-line then back online. Then the bug was fixed (I don't remember the bug number) and henceforward DNS records were only cached for a few minutes. It was in 2002, see bug 151929 comment #2 (bug still open???) Ah, there, found it: bug 162871 fixed 2003-10-27
I'm not quite sure what the right thing to do here is, to be honest...
I keep hitting this too (on Linux), I really don't think we want to ship Firefox 3 with this crasher. We had one of these a while back as well, caused lots of crashes with shaky WiFi connections in cafes n' what not.
+'ing this as I agree with Johnny. Also, we need to make sure that we include tests with the patch as well. Reviewers should make sure they exist!
Created attachment 313738 [details] [diff] [review] patch v2 This patch is a minor improvement over Florian's patch (attachment 312612 [details] [diff] [review]). jst, could you please test it? Thanks.
Comment on attachment 313738 [details] [diff] [review] patch v2 r=jst, I've been running with this since friday and no crash yet. Given the nature of this I can't say for sure whether this completely fixes the problem, but it's looking good so far. Definitely worth getting in IMO.
Comment on attachment 313738 [details] [diff] [review] patch v2 Christian, could you superreview this patch. The change to nsDNSService2.cpp should be easy to understand. It looks like a standard crash-avoiding check, so I added a comment to explain how that condition can be true in practice. The change to nsHostResolver.cpp is not necessary for fixing this crash. It is intended to make the nsHostRecord expire immediately if the record contains no lookup results. (It doesn't make sense for an empty record to expire in the future.) Since our code doesn't check rec->expiration if |rec| is empty, this change is merely a code cleanup.
(In reply to comment #7) Damon, I'm sorry that I won't have time to write a test for this bug. It seems that mozilla/netwerk/test/TestDNS.cpp could be extended or a new test based on it could be created to test the fix for this bug. If anyone volunteers, I'll be happy to explain the bug to him.
10 years ago
I checked in the patch on the Mozilla trunk for Firefox 3 RC1. Checking in nsDNSService2.cpp; /cvsroot/mozilla/netwerk/dns/src/nsDNSService2.cpp,v <-- nsDNSService2.cpp new revision: 1.15; previous revision: 1.14 done Checking in nsHostResolver.cpp; /cvsroot/mozilla/netwerk/dns/src/nsHostResolver.cpp,v <-- nsHostResolver.cpp new revision: 1.23; previous revision: 1.22 done