Closed
Bug 337418
Opened 18 years ago
Closed 17 years ago
[@ PR_EnumerateHostEnt] flaky wifi connection or vpn or laptop sleeping
Categories
(Core :: Networking, defect, P4)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla1.9beta3
People
(Reporter: timeless, Assigned: wtc)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(6 files, 3 obsolete files)
2.06 KB,
patch
|
Details | Diff | Splinter Review | |
2.99 KB,
patch
|
Details | Diff | Splinter Review | |
217.01 KB,
application/octet-stream
|
Details | |
233.96 KB,
application/octet-stream
|
Details | |
160.00 KB,
application/octet-stream
|
Details | |
8.61 KB,
patch
|
Details | Diff | Splinter Review |
Incident ID: 18519692 Stack Signature PR_EnumerateHostEnt aad9b2b0 Product ID Firefox15 Build ID 2006050410 Trigger Time 2006-05-10 04:58:57.0 Platform Win32 Operating System Windows NT 5.1 build 2600 Module nspr4.dll + (0000cb31) URL visited no clue i was hopping from wifi network to wifi network and pac/no pac User Comments Since Last Crash 70200 sec Total Uptime 90970 sec Trigger Reason Access violation Source File, Line No. c:/builds/tinderbox/Fx-Mozilla1.8.0/WINNT_5.2_Depend/mozilla/nsprpub/pr/src/misc/prnetdb.c, line 1408 Stack Trace PR_EnumerateHostEnt [c:/builds/tinderbox/Fx-Mozilla1.8.0/WINNT_5.2_Depend/mozilla/nsprpub/pr/src/misc/prnetdb.c, line 1408] jay: could you please pull detailed stack information for this crash?
Assignee | ||
Comment 1•18 years ago
|
||
In Firefox 1.5, PR_EnumerateHostEnt is only used in LDAP and NSS (OCSP). The statement at prnetdb.c, line 1408 is 1408 memcpy(&address->inet.ip, addr, hostEnt->h_length);
firefox doesn't use ldap. and i'd be fairly surprised about oscp. why do you say it isn't used by PR_EnumeateAdrrInfo? that's used by netwerk/dns/src/nsDNSService2.cpp (nsDNSRecord::GetNextAddr) which indeed runs on threads and is a much more likely crash point.
Comment 4•18 years ago
|
||
not sure if it's related, but I had 2 crashes w/ similar stack signatures back in June: TB19985747 & TB19998195
finally hit this with both a firefox1.8.0.7 and mcsmurf build. here's the output from the mcsmurf build. it shows that i was correct in my analysis in comment 2. (ce0.a24): Break instruction exception - code 80000003 (first chance) eax=7ffde000 ebx=00000001 ecx=00000002 edx=00000003 esi=00000004 edi=00000005 eip=7c901230 esp=017bffcc ebp=017bfff4 iopl=0 nv up ei pl zr na po nc cs=001b ss=0023 ds=0023 es=0023 fs=0038 gs=0000 efl=00000246 ntdll!DbgBreakPoint: 7c901230 cc int 3 0:001> g (ce0.16ec): Access violation - code c0000005 (!!! second chance !!!) eax=125bf946 ebx=125bf942 ecx=00000001 edx=00000000 esi=125bf942 edi=0537c67c eip=78144e38 esp=019afdb4 ebp=019afdbc iopl=0 nv up ei ng nz ac po cy cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000297 MSVCR80!UnwindUpVec+0x50: 78144e38 8b448efc mov eax,[esi+ecx*4-0x4] ds:0023:125bf942=???????? 0:001> !analyze -v -f ******************************************************************************* * * * Exception Analysis * * * ******************************************************************************* FAULTING_IP: MSVCR80!UnwindUpVec+50 [F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322] 78144e38 8b448efc mov eax,[esi+ecx*4-0x4] EXCEPTION_RECORD: ffffffff -- (.exr ffffffffffffffff) ExceptionAddress: 78144e38 (MSVCR80!UnwindUpVec+0x00000050) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter[0]: 00000000 Parameter[1]: 125bf942 Attempt to read from address 125bf942 FAULTING_THREAD: 000016ec DEFAULT_BUCKET_ID: APPLICATION_FAULT PROCESS_NAME: firefox.exe ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s". READ_ADDRESS: 125bf942 BUGCHECK_STR: ACCESS_VIOLATION LAST_CONTROL_TRANSFER: from 3000d5d5 to 78144e38 STACK_TEXT: 019afdbc 3000d5d5 0537c67c 125bf942 00000004 MSVCR80!UnwindUpVec+0x50 [F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322] 019afddc 3000e408 00000002 02985408 000001bb nspr4!PR_EnumerateHostEnt+0x95 [h:\mozilla\tree- main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 1414] 019afdf0 0043031c 00000002 02985008 000001bb nspr4!PR_EnumerateAddrInfo+0x28 [h:\mozilla\tree- main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 2154] 019afe10 00459ed4 04585438 000001bb 0537c678 firefox!nsDNSRecord::GetNextAddr+0x30 [h:\mozilla\tree-main\mozilla\netwerk\dns\src\nsdnsservice2.cpp @ 116] 019afe34 0045a414 0154e0a0 0154e0c8 0537c620 firefox!nsSocketTransport::RecoverFromError+0x86 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1218] 019afe48 00431498 036bad18 00000000 0154e0a0 firefox!nsSocketTransport::OnSocketDetached+0x3c [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1534] 019afe60 004318c5 0154e0c8 00000001 0154e0a0 firefox!nsSocketTransportService::DetachSocket+0x30 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 163] 019afe84 00431930 00000002 00000000 0154ebc8 firefox! nsSocketTransportService::DoPollIteration+0x22f [h:\mozilla\tree- main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 641] 019afe98 0029e39d 0154e0a8 0154ebc8 00000001 firefox! nsSocketTransportService::OnProcessNextEvent+0x40 [h:\mozilla\tree- main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 484] 019afec4 0027582f 0154ebc8 00000001 019afee0 xpcom_core!nsThread::ProcessNextEvent+0x53 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 473] 019afed8 004315d9 0154ebc8 00000001 0154e730 xpcom_core!NS_ProcessNextEvent_P+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225] 019afefc 0029e3df 0154e0ac 002d133c 00000000 firefox!nsSocketTransportService::Run+0x86 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 525] 019aff1c 0027582f 0154ebc8 00000001 019aff38 xpcom_core!nsThread::ProcessNextEvent+0x95 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 483] 019aff30 0029e2e8 00000001 00000001 014139fc xpcom_core!NS_ProcessNextEvent_P+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225] 019aff50 30015e7b 0154ebc8 01354410 01354410 xpcom_core!nsThread::ThreadFunc+0x5c [h:\mozilla\tree -main\mozilla\xpcom\threads\nsthread.cpp @ 251] 019aff64 7c910732 00000006 30017f2d 014d1e58 nspr4!_PR_NativeRunThread+0xdb [h:\mozilla\tree- main\mozilla\nsprpub\pr\src\threads\combined\pruthr.c @ 458] 019affac 78132a36 00000006 7c80b683 01537028 ntdll!RtlpAllocateFromHeapLookaside+0x42 019affbc 7c910732 00000006 01537028 7ffdb000 MSVCR80!_threadstartex+0x66 [f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 326] 019affec 00000000 781329d0 01537028 00000000 ntdll!RtlpAllocateFromHeapLookaside+0x42 FOLLOWUP_IP: MSVCR80!UnwindUpVec+50 [F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322] 78144e38 8b448efc mov eax,[esi+ecx*4-0x4] FAULTING_SOURCE_CODE: SYMBOL_STACK_INDEX: 0 FOLLOWUP_NAME: MachineOwner SYMBOL_NAME: MSVCR80!UnwindUpVec+50 MODULE_NAME: MSVCR80 IMAGE_NAME: MSVCR80.dll DEBUG_FLR_IMAGE_TIMESTAMP: 4333a455 STACK_COMMAND: ~1s ; kb FAILURE_BUCKET_ID: ACCESS_VIOLATION_MSVCR80!UnwindUpVec+50 BUCKET_ID: ACCESS_VIOLATION_MSVCR80!UnwindUpVec+50 Followup: MachineOwner ---------
for reference comment 5 details thread 1 (16ec), the following is thread 2: ntdll!KiFastSystemCallRet ntdll!ZwWaitForMultipleObjects+0xc kernel32!WaitForMultipleObjectsEx+0x12c kernel32!WaitForMultipleObjects+0x18 firefox!nsNotifyAddrListener::Run(void)+0x66 [h:\mozilla\tree-main\mozilla\netwerk\system\win32\nsnotifyaddrlistener.cpp @ 320] xpcom_core!nsThread::ProcessNextEvent(int mayWait = 1, int * result = 0x01abff38)+0x95 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 483] xpcom_core!NS_ProcessNextEvent_P(class nsIThread * thread = 0x00000001, int mayWait = 1)+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225] xpcom_core!nsThread::ThreadFunc(void * arg = 0x015385d0)+0x5c [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 251] nspr4!_PR_NativeRunThread(void * arg = 0x78132a36)+0xdb [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\threads\combined\pruthr.c @ 458] nspr4!pr_root(void * arg = 0x78132a36)+0xd [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\md\windows\w95thred.c @ 120] MSVCR80!_callthreadstartex(void)+0x1b [f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 348] MSVCR80!_threadstartex(void * ptd = 0x00000000)+0x66 [f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 326] kernel32!BaseThreadStart+0x37 and this is thread 1 again but with arguments: MSVCR80!UnwindUpVec+0x50 [F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322] nspr4!PR_EnumerateHostEnt(int enumIndex = 43536408, struct PRHostEnt * hostEnt = 0x00040002, unsigned short port = 0x501c, union PRNetAddr * address = 0x00000000)+0x95 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 1414] nspr4!PR_EnumerateAddrInfo(void * iterPtr = 0x00459ed4, struct PRAddrInfo * base = 0x04585438, unsigned short port = 0x1bb, union PRNetAddr * result = 0x0537c678)+0x28 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 2154] firefox!nsDNSRecord::GetNextAddr(unsigned short port = 0x1bb, union PRNetAddr * addr = 0x0537c678)+0x30 [h:\mozilla\tree-main\mozilla\netwerk\dns\src\nsdnsservice2.cpp @ 116] firefox!nsSocketTransport::RecoverFromError(void)+0x86 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1218] firefox!nsSocketTransport::OnSocketDetached(struct PRFileDesc * fd = 0x036bad18)+0x3c [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1534] firefox!nsSocketTransportService::DetachSocket(struct nsSocketTransportService::SocketContext * sock = 0x0154e0c8)+0x30 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 163] firefox!nsSocketTransportService::DoPollIteration(int wait = 2)+0x22f [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 641] firefox!nsSocketTransportService::OnProcessNextEvent(class nsIThreadInternal * thread = 0x0154ebc8, int mayWait = 1, unsigned int depth = 0)+0x40 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 484] xpcom_core!nsThread::ProcessNextEvent(int mayWait = 1, int * result = 0x019afee0)+0x53 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 473] xpcom_core!NS_ProcessNextEvent_P(class nsIThread * thread = 0x0154ebc8, int mayWait = 1)+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225] firefox!nsSocketTransportService::Run(void)+0x86 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 525] xpcom_core!nsThread::ProcessNextEvent(int mayWait = 1, int * result = 0x019aff38)+0x95 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 483] xpcom_core!NS_ProcessNextEvent_P(class nsIThread * thread = 0x00000001, int mayWait = 1)+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225] xpcom_core!nsThread::ThreadFunc(void * arg = 0x0154ebc8)+0x5c [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 251] nspr4!_PR_NativeRunThread(void * arg = 0x78132a36)+0xdb [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\threads\combined\pruthr.c @ 458] ntdll!RtlpAllocateFromHeapLookaside+0x42 MSVCR80!_threadstartex(void * ptd = 0x00000000)+0x66 [f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 326] ntdll!RtlpAllocateFromHeapLookaside+0x42
Assignee | ||
Comment 7•18 years ago
|
||
timeless, you're right. Firefox calls PR_EnumerateAddrInfo, which calls PR_EnumerateHostEnt. As far as I can tell, it seems that we crash in memcpy because the second argument 125bf942 (the source address for memcpy) is a bad address. PR_EnumerateHostEnt has two memcpy calls. We are crashing in the second memcpy call at line 1414 for the IPv4 case: 1412 PR_ASSERT(AF_INET == hostEnt->h_addrtype); 1413 address->inet.port = htons(port); 1414 memcpy(&address->inet.ip, addr, hostEnt->h_length); As I said above, I believe 'addr' is a bad address. I suggest that you or Darin artificially cause the following code path to be executed in a debugger and see what happens: nspr4!PR_EnumerateHostEnt nspr4!PR_EnumerateAddrInfo firefox!nsDNSRecord::GetNextAddr firefox!nsSocketTransport::RecoverFromError firefox!nsSocketTransport::OnSocketDetached firefox!nsSocketTransportService::DetachSocket firefox!nsSocketTransportService::DoPollIteration firefox!nsSocketTransportService::OnProcessNextEvent
Assignee | ||
Comment 8•18 years ago
|
||
timeless, your thread 1 stack with arguments in comment 6 seems strange in these three stack frames at the bottom: nspr4!PR_EnumerateHostEnt(int enumIndex = 43536408, struct PRHostEnt * hostEnt = 0x00040002, unsigned short port = 0x501c, union PRNetAddr * address = 0x00000000)+0x95 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 1414] nspr4!PR_EnumerateAddrInfo(void * iterPtr = 0x00459ed4, struct PRAddrInfo * base = 0x04585438, unsigned short port = 0x1bb, union PRNetAddr * result = 0x0537c678)+0x28 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 2154] firefox!nsDNSRecord::GetNextAddr(unsigned short port = 0x1bb, union PRNetAddr * addr = 0x0537c678)+0x30 [h:\mozilla\tree-main\mozilla\netwerk\dns\src\nsdnsservice2.cpp @ 116] The 'unsigned short port' argument and the 'union PRNetAddr *' argument change values when they reach the PR_EnumerateHostEnt function. This doesn't match the PR_EnumerateAddrInfo source code, which passes those two values unchanged to PR_EnumerateHostEnt. If the 'union PRNetAddr * address' argument to PR_EnumerateHostEnt really is 0x00000000, we would have crashed earlier at the memset call at line 1398: 1398 memset(address, 0, sizeof(PRNetAddr));
I'm aware of that. I'm at home, I'll be at work shortly. I had the disassembly for that frame in notepad, I'll attach it if I remember. please keep in mind that windbg shows the value of the variable which is aliased as the argument at that frame, if some optimization results in the argument's slot being used for some other purpose, then *poof*, you'll see something strange. you should always assume that the correct argument was passed from the calling frame and that the variable has since been modified and what's being shown is either the local value of the argument or some other variable that shared the register.
Comment 10•18 years ago
|
||
I'm seeing this too, in our 3rd party product, using MOZILLA_1_8_BRANCH: Crashing instruction: memcpy(&address->inet.ip, addr, hostEnt->h_length); Crashing thread 14 00 nspr4!PR_EnumerateHostEnt(int enumIndex = <Memory access error>, struct PRHostEnt * hostEnt = <Memory access error>, unsigned short port = <Memory access error>, union PRNetAddr * address = <Memory access error>)+0x80 01 nspr4!PR_EnumerateAddrInfo(void * iterPtr = 0x00c47a3f, struct PRAddrInfo * base = 0x0581f2d8, unsigned short port = 0x50, union PRNetAddr * result = 0x057606a8)+0x28 02 necko!nsDNSRecord::GetNextAddr(unsigned short port = 0x50, union PRNetAddr * addr = 0x057606a8)+0x29 03 necko!nsSocketTransport::RecoverFromError(void)+0x8a 04 necko!nsSocketTransport::OnSocketDetached(struct PRFileDesc * fd = 0x05824e70)+0x3d 05 necko!nsSocketTransportService::DetachSocket(struct nsSocketTransportService::SocketContext * sock = 0x008ebdc6)+0x31 06 necko!nsSocketTransportService::Run(void)+0x238 07 xpcom_core!nsThread::Main(void * arg = 0x7c80b683)+0x1e 08 nspr4!_PR_NativeRunThread(void * arg = 0x7c80b683)+0xd1 09 js3250!JS_DHashTableOperate(struct JSDHashTable * table = 0x00000004, void * key = 0x05760650, JSDHashOperator op = 11569720 (No matching enumerant))+0xe1 0a kernel32!BaseThreadStart+0x37
Comment 11•18 years ago
|
||
Is this a dup of bug 290190?
Updated•18 years ago
|
QA Contact: wtchang → nspr
Comment 12•17 years ago
|
||
I got this crash also on: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; es-ES; rv:1.9a9pre) Gecko/2007101904 Minefield/3.0a9pre Just surfing around on gmail, and my home network went down. The crash happened afterwards. Crash reporter: http://crash-stats.mozilla.com/report/index/8eaf80c0-7f28-11dc-97b2-001a4bd46e84?date=2007-10-20-16
Comment 14•17 years ago
|
||
(In reply to comment #13) > *** Bug 401500 has been marked as a duplicate of this bug. *** > Bug 401500 was a Linux crash, not WinXP. Comment #12 is for an Intel Mac. Please set this bug's OS to ALL.
Comment 15•17 years ago
|
||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; es-ES; rv:1.9a9pre) Gecko/2007102904 Minefield/3.0a9pre I repro'd this again on intel mac today. This is becoming a common crash for me, as my mbp has a flaky wireless card to begin with. My common steps is my Airport wifi will drop connection, and in the middle of surfing on the browser, it will crash and generate this stack: http://crash-stats.mozilla.com/report/index/9cc4d147-8657-11dc-bb04-001a4bd43e5c?date=2007-10-29-19 This has quietly made it up to #13 on Top Crashes. Recommending blocking1.9 flag for M9.
Flags: blocking1.9?
Assignee | ||
Comment 16•17 years ago
|
||
This bug should be investigated as a Core:Networking bug. Tony (Chung), if you can reproduce this crash easily, could you cause a debug build to crash in the debugger and find out which of the three pointer dereferences on this line is the culprit? prnetdb.c:2098 result->raw.family = ai->ai_addr->sa_family; In other words, please find out whether 'result', 'ai', or 'ai->ai_addr' is a bad pointer when Firefox crashes. If you don't have the setup to do that, I will check in a patch to do one pointer dereferencing per line and ask you to crash Firefox and submit a Talkback report.
Comment 17•17 years ago
|
||
I dont have a debug environment running here. Since trunk is using breakpad, i wont be able to run talkback since i'm testing on trunk nightlies. Are these stack traces not enough information? I can try arranging to use the apple crash reporter as well if you need more data. Otherwise, i'd be happy to continue reporting these crashes if it still repros with your patch.
Assignee | ||
Comment 18•17 years ago
|
||
Tony, I meant breakpad crash reports. Sorry. I looked at the source code for an hour. The only problem I found is that in a nsDNSRecord, mIter and mHostRecord->addr_info will get out of sync if nsHostResolver::OnLookupComplete is called on the nsHostRecord that mHostRecord points to: nsHostResolver.cpp: 602 void 603 nsHostResolver::OnLookupComplete(nsHostRecord *rec, nsresult status, PRAddrInfo *result) 604 { ... 615 // update record fields. We might have a rec->addr_info already if a 616 // previous lookup result expired and we're reresolving it.. 617 if (rec->addr_info) 618 PR_FreeAddrInfo(rec->addr_info); 619 rec->addr_info = result; When this happens, we need to invalidate the mIter fields of all nsDNSRecord objects whose mHostRecord points to rec. But I haven't figured out how to do that. Also, after invalidating mIter, I am not sure if rewinding mIter (resetting it to nsnull) is the right action because it'll likely cause us to retry an address that we used before. I have a hunch that I'm on the right track, and hope that someone more familiar with the networking code can figure out what's going on from this description.
Comment 19•17 years ago
|
||
(In reply to comment #18) [...] > I looked at the source code for an hour. The only problem I found > is that in a nsDNSRecord, mIter and mHostRecord->addr_info will get > out of sync if nsHostResolver::OnLookupComplete is called on > the nsHostRecord that mHostRecord points to: [...] Huh-uh. Two days ago I had a similar crash (see bug 401500, duped to this one), for the first time in my life, on a desktop computer with a permanent "copper" connection and in a city with a good power supply, but at a time when up to 5 tabs, including 3 for tinderbox.mozilla.org, might have decided to auto-refresh. A race condition such as what you're describing sounds like a plausible explanation (to me, who don't know Mozilla code).
Assignee: wtc → nobody
Component: NSPR → Networking
Product: NSPR → Core
QA Contact: nspr → networking
Version: other → Trunk
Assignee | ||
Comment 20•17 years ago
|
||
I can't figure out a way to cause Firefox to crash there. If my theory is correct, the crash is caused by reading from memory that has been freed. But freed memory reads don't always result in a crash. :-(
Comment 21•17 years ago
|
||
Got another crash today. Same thing, was surfing around and minding my own business, and my wifi on my router dropped connection. Crash happens shortly after Firefox tried to connect to the server. It doesnt happen everytime my wifi drops, but its the third time i've already seen this crash report. Running Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; es-ES; rv:1.9a9pre) Gecko/2007110104 Minefield/3.0a9pre. Breakpad Report: http://crash-stats.mozilla.com/report/index/3449ede1-8893-11dc-af4e-001a4bd46e84?date=2007-11-01-15
Assignee | ||
Comment 23•17 years ago
|
||
The crash reports in this bug and the crash stats provided by Tony Mechelynck in bug 401500 comment 1 are strong support for my theory that it is mIter that is bad. All the crashes occured on IPv6-enabled systems (which is why most of them are Mac OS X and Linux) where mIter is a pointer to a node in a linked list. If the linked list changes, the pointer becomes invalid. On other systems, mIter is an integer index into an array. If the array changes but the index is still within the bounds of the new array (which is likely), the index is still "valid", hence no crash. I will try to spend some time this weekend putting together a defensive patch.
Assignee | ||
Comment 24•17 years ago
|
||
This patch makes PR_EnumerateHostEnt and PR_EnumerateAddrInfo validate their enumerator/iterator argument. We should use this patch only as a last resort because it punishes all callers of these functions. On the other hand, the overhead of the index/pointer validation should be low for most hosts because most hosts have only a small number of addresses.
Assignee | ||
Comment 25•17 years ago
|
||
Tony Chung, and anyone who encounters this crash on Mac OS X Intel: Please replace the libnspr4.dylib in your Firefox 3 trunk build with this libnspr4.dylib. Let me know if this makes the crash go away. If it does, this will be another strong support for my theory.
Comment 26•17 years ago
|
||
Hi Wan-teh, Thanks for taking the time to investigate and work on this. i will try out .dylib file locally against my nightly builds, and monitor it for a few days to see if it still crashes. Will let you know if i see crashes.
Comment 27•17 years ago
|
||
(In reply to comment #24) > Created an attachment (id=287197) > Last resort: PR_EnumerateAddrInfo validates its iterPtr argument > > This patch makes PR_EnumerateHostEnt and PR_EnumerateAddrInfo > validate their enumerator/iterator argument. We should use this > patch only as a last resort because it punishes all callers of > these functions. On the other hand, the overhead of the index/pointer > validation should be low for most hosts because most hosts have only > a small number of addresses. > Hm, I see. Not an elegant solution, as you said; but I guess that, unless you can come up with something better, a small slowdown is better than a big crash. ;-) -- a couple of "style nits": - shouldn't the condition in the "for" statement be *ap != NULL rather than != 0 ? (this, I think, doesn't change the logic, only makes it clearer) - more important: if na is the _number_ of entries, isn't there an off-by-one error in the validation of enumIndex? (< 0 or >= na)
Assignee | ||
Comment 28•17 years ago
|
||
Comment on attachment 287197 [details] [diff] [review] Last resort: PR_EnumerateAddrInfo validates its iterPtr argument Tony M.: thanks for the review comments. Re: style: this for loop was copied from elsewhere in the same file. I'm following the coding style of the original author. It's more important to use a consistent style than my favorite style. Re: off-by-one error: it is better to think of na as the index of the last array element than the number of elements. Then it's clear that a valid enumIndex should satisfy 0 <= enumIndex <= na. Why is na the index of the last array element? hostEnt->h_addr_list is an array of (char *) pointers, with the last array element being NULL. In the for loop, we move ap and na in locked step. So when *ap is NULL (the last array element), na is the index of that element.
Assignee | ||
Comment 29•17 years ago
|
||
[Note: I can build nspr4.dll for Windows or libnspr4.so for Linux if you'd like to test my "last resort" NSPR patch. Just send me email.] This patch uses a generation count to detect that mHostRecord->addr_info has changed since we last used mIter. Given my limited knowledge of NetLib, this is the best solution I can come up with. I have three questions. 1. I backed out Darin's "bandaid" patch (attachment 199372 [details] [diff] [review]) from bug 290190. Should I leave the NS_ENSURE_STATE(mHostRecord->addr) there and just remove the comment "but see bug 290190"? 2. When we detect that mHostRecord->addr_info has changed, should we restart the iteration or just fail? 3. Does nsHostRecord need to be protected with a lock? I know it is used by at least two threads (the nsHostResolver worker thread and the thread that calls nsSocketTransportService::DoPollIteration), but I don't know whether these two threads may access nsHostRecord simultaneously or nsHostRecord is always handed off from one thread to another.
Comment 30•17 years ago
|
||
(In reply to comment #28) > (From update of attachment 287197 [details] [diff] [review]) > Tony M.: thanks for the review comments. > > Re: style: this for loop was copied from elsewhere in the same file. > I'm following the coding style of the original author. It's more > important to use a consistent style than my favorite style. Well... OK > > Re: off-by-one error: it is better to think of na as the index of > the last array element than the number of elements. Then it's > clear that a valid enumIndex should satisfy 0 <= enumIndex <= na. > > Why is na the index of the last array element? hostEnt->h_addr_list > is an array of (char *) pointers, with the last array element being > NULL. In the for loop, we move ap and na in locked step. So when > *ap is NULL (the last array element), na is the index of that element. > The terminating NULL is a valid array element, is that it? Array bound checking has always been one of my worst nightmares. (In reply to comment #29) [...] > [Note: I can build nspr4.dll for Windows or libnspr4.so for Linux if > you'd like to test my "last resort" NSPR patch. Just send me email.] [...] Good to know. I won't do it yet, because I've only had the crash once, and it would mean overwriting /usr/local/seamonkey/libnspr4.so again after each nightly install, but if I start crashing this way repeatedly I'll write -- unless you upload it somewhere first. :-) :-) :-)
Assignee | ||
Comment 31•17 years ago
|
||
Yes, the terminating NULL is the last valid array element.
Assignee | ||
Comment 32•17 years ago
|
||
Whenever an invalid pointer or array index is detected, we write a line to the log file /tmp/wtc-getnextaddr.log.
Assignee | ||
Comment 33•17 years ago
|
||
Tony C.: please use this libnspr4.dylib instead. It writes a line to the log file /tmp/wtc-getnextaddr.log whenever it detects a bad pointer. Please run the Firefox 3 trunk build with this libnspr4.dylib and report back in a week or two whether the log file /tmp/wtc-getnextaddr.log is created.
Attachment #287198 -
Attachment is obsolete: true
Updated•17 years ago
|
Flags: blocking1.9? → blocking1.9+
Reporter | ||
Comment 34•17 years ago
|
||
wtc: fwiw my wXP laptop does not seem to have ipv6 enabled
Comment 35•17 years ago
|
||
wtc: just fyi i replaced did libnspr4.dylib file for a day so far, and no crash to reproduce yet. Will keep trying. Also, i can try running windows side by side too, so if you have the .dll handy, just attach it here and i'll do the same. question, if i update to nightly builds, i'll need to manually replace the .dylib file each time afterwards, correct?
Assignee | ||
Comment 36•17 years ago
|
||
Tony C: yes, whenever you update to a new nightly build, you need to manually replace the libnspr4.dylib file. Please also check /tmp/wtc-getnextaddr.log regularly, and report back as soon as the log file is created. Thanks. Here is the libnspr4.so for Linux x86. I will provide the Windows DLL later today.
Assignee | ||
Comment 37•17 years ago
|
||
Updated•17 years ago
|
Priority: -- → P1
Comment 38•17 years ago
|
||
So this is currently identified as a P1 blocker, which means that it needs to be fixed by 1.9b2, in about a month. wtc: it looks like you're making progress here, can you take this bug?
Assignee | ||
Comment 39•17 years ago
|
||
I'm working on this bug in my spare time (weekends). I can't reproduce the crash, which makes debugging very difficult. All I have now is a guess of the cause of the crash. Unless Tony C. reports back that a /tmp/wtc-getnextaddr.log file is created on his system by my libnspr4.dylib file, I am blocked. I'd also need the real Mozilla developers to answer my questions in comment 29. Thanks!
Status: NEW → ASSIGNED
Comment 40•17 years ago
|
||
ive ran the nightlies this week on mac and the attachment wtc put in, but still no luck yet on a crash. I'll be running it this week against the beta 1 candidate. Will let you know once i get a repro.
Updated•17 years ago
|
Assignee: nobody → wtc
Status: ASSIGNED → NEW
Comment 41•17 years ago
|
||
wtc: it seems like it would be better to deep copy the addrinfo from nspr somehow. e.g., perform the complete enumeration and save the results. then nsDNSRecord can just serve the copy of the results. that would simplify all of the issues related to multithreading.
Comment 42•17 years ago
|
||
This is not a blocker, it's not a regression. We'll take a patch if you have one.
Flags: blocking1.9+ → blocking1.9-
Priority: P1 → P4
Assignee | ||
Comment 43•17 years ago
|
||
Darin, thank you for the comment. Saving a deep copy of the addrinfo from NSPR in nsDNSRecord will consume more memory (popular websites have multiple IP addresses) and defeat the expiration of the addrinfo in nsHostRecord. Each PRNetAddr is ~ 108 bytes. We can limit the memory overhead by copying only two or three IP addresses, which should be sufficient in practice, but I'm worried about not doing expiration. On the other hand, serving the copy of the results will avoid the awkward issue of restarting the iteration when the addrinfo in nsHostRecord has changed. Do you think only saving 2 or 3 addresses and not doing expiration is OK? I decided to finish my patch by adding a lock to each nsHostRecord. I was worried about the memory overhead of one PRLock per nsHostRecord, but it cannot be worse than the deep copying approach. Questions: 1. In nsDNSRecord::GetCanonicalName, does result.Assign(cname) copy the |cname| string into |result|? I browsed around in LXR but couldn't find the implementation of nsACString::Assign(). 2. What's the name of the thread that calls nsDNSRecord::GetCanonicalName and nsDNSRecord::GetNextAddr? In this comment I call it Thread X. 3. I studied the source code and found that nsHostRecord is used by the nsHostResolver worker thread and Thread X, at least. Thread X only accesses the following fields: - flags - host - addr_info - addr The |flags|, |host|, and |addr| fields do not change after they have been assigned values. This is why |addr_info_lock| only needs to protect |addr_info| and the |addr_info_gencnt| that I added. In addition, Thread X only reads nsHostRecord. Only the nsHostResolver worker thread modifies nsHostRecord (and only in nsHostResolver::OnLookupComplete). This is relevant because the worker thread doesn't need to lock when reading the addrinfo. Please verify the above description of thread access to nsHostRecord, because it determines the correctness of my use of locking. I will add a concise summary as a comment to nsHostResolver.h. 4. I use PR_Lock/PR_Unlock so I can hold the lock for the shortest time. If you'd prefer that I use nsAutoLock, I can do that.
Attachment #287199 -
Attachment is obsolete: true
Attachment #288176 -
Flags: superreview?(darin.moz)
Attachment #288176 -
Flags: review?(cbiesinger)
Updated•17 years ago
|
Attachment #288176 -
Attachment is patch: true
Attachment #288176 -
Attachment mime type: application/octet-stream → text/plain
Comment 44•17 years ago
|
||
jst and I just spent a while coming up with basically the conclusions in this bug, with the current crash in fact being a regression from the patch in bug 397716. Before that patch, we would just leak the old addr info in the situation described in comment 18. Now we free a data structure that someone might still be referencing, leading to a crash. Renominating based on that, since at this point that crash seems to be what this bug is about. There's quite a number of breakpad reports for this. My initial instinct for fixing this was to leave existing consumers who are holding long-term references to nsHostRecords with stale information and just create a new nsHostRecord if we have to resolve and we're not the only ones with a ref to the nsHostRecord. If people feel that this is a better solution, I can attach the patch I have, but personally I like wtc's approach more.
Assignee | ||
Comment 45•17 years ago
|
||
Boris, your approach seems similar to what Darin proposed in comment 41. The advantage of your approaches is that we don't need to deal with the issue of how to restart a PRAddrInfo enumeration with a new nsHostRecord.
Comment 46•17 years ago
|
||
Right. And we'd only copy if the host record expired and someone was still holding on to it. I guess it's a matter of what contract the DNS service exposes...
Comment 48•17 years ago
|
||
wtc, just wanted to let u know i still run the .dylib file against my nightly builds. i just have not found a repro crash since my last comment yet, so still investigating.
Comment 49•17 years ago
|
||
(In reply to comment #43) > 1. In nsDNSRecord::GetCanonicalName, does result.Assign(cname) > copy the |cname| string into |result|? I browsed around in > LXR but couldn't find the implementation of nsACString::Assign(). The implementation is at http://lxr.mozilla.org/seamonkey/source/xpcom/string/src/nsTAString.cpp#227, usually leading to http://lxr.mozilla.org/seamonkey/source/xpcom/string/src/nsTSubstring.cpp#374 It just adds a reference to the buffer when possible; otherwise copies (strings are copy-on-write) > 2. What's the name of the thread that calls nsDNSRecord::GetCanonicalName > and nsDNSRecord::GetNextAddr? In this comment I call it Thread X. GetNextAddr is often called by the socket transport thread, but it can be called by PAC and other code on the main thread as well. > Please verify the above description of thread access to > nsHostRecord, because it determines the correctness of > my use of locking. I will add a concise summary as a > comment to nsHostResolver.h. The description is correct. > 4. I use PR_Lock/PR_Unlock so I can hold the lock for > the shortest time. If you'd prefer that I use nsAutoLock, > I can do that. I prefer that. You can use something like: { nsAutoLock autolock(foo); // ... } to limit the amount of time you hold the lock.
Comment 50•17 years ago
|
||
Ah, I see. I wrote that last comment before reading the patch. nsAutoLock wouldn't really work for the nsDNSRecord::GetNextAddr; feel free to keep the PR_Lock/PR_Unlock there.
Updated•17 years ago
|
Attachment #288176 -
Flags: review?(cbiesinger) → review+
Comment 51•17 years ago
|
||
Comment on attachment 288176 [details] [diff] [review] Proposed patch Boris, any chance you could sr this? If we get this reviewed real soon, we could try to get this in for beta2. Darin, if you're listening and you have the cycles, please don't let me stop you from sr:ing here :)
Attachment #288176 -
Flags: superreview?(darin.moz) → superreview?(bzbarsky)
Comment 52•17 years ago
|
||
Comment on attachment 288176 [details] [diff] [review] Proposed patch >Index: netwerk/dns/src/nsDNSService2.cpp I don't see anything preventing Rewind() getting called while we're in the middle of GetNextAddr() (from different threads). Perhaps there should be? File a followup bug, since that's been a problem all along? Other than that, looks great. sr=bzbarsky.
Attachment #288176 -
Flags: superreview?(bzbarsky) → superreview+
Assignee | ||
Comment 53•17 years ago
|
||
Boris: good question. I'm assuming that GetNextAddr() and Rewind() are called from only one thread, and therefore it's not necessary to use a lock to protect access to mIter and mDone. Also, this LXR query seems to show that Rewind() is defined but not used: http://lxr.mozilla.org/seamonkey/ident?i=Rewind I checked in this patch, after updating the comments in nsHostResolver.h. Checking in nsDNSService2.cpp; /cvsroot/mozilla/netwerk/dns/src/nsDNSService2.cpp,v <-- nsDNSService2.cpp new revision: 1.14; previous revision: 1.13 done Checking in nsHostResolver.cpp; /cvsroot/mozilla/netwerk/dns/src/nsHostResolver.cpp,v <-- nsHostResolver.cpp new revision: 1.21; previous revision: 1.20 done Checking in nsHostResolver.h; /cvsroot/mozilla/netwerk/dns/src/nsHostResolver.h,v <-- nsHostResolver.h new revision: 1.9; previous revision: 1.8 done
Attachment #288176 -
Attachment is obsolete: true
Assignee | ||
Updated•17 years ago
|
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M10
Updated•17 years ago
|
Target Milestone: mozilla1.9 M10 → mozilla1.9 M11
Comment 54•16 years ago
|
||
With my trunk build (which is already a bit old: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9b4pre) Gecko/2008022621 Minefield/3.0b4pre) I sometimes get crashes with a stack like this one: Thread 1 Crashed: 0 <<00000000>> 0xffff07fc __memcpy + 92 (cpu_capabilities.h:228) 1 libnecko.dylib 0x143e08aa nsDNSRecord::GetNextAddr(unsigned short, PRNetAddr*) + 460 (nsDNSService2.cpp:143) 2 libnecko.dylib 0x143cf4be nsSocketTransport::RecoverFromError() + 426 (nsSocketTransport2.cpp:1242) 3 libnecko.dylib 0x143cf61d nsSocketTransport::OnSocketDetached(PRFileDesc*) + 173 (nsSocketTransport2.cpp:1559) ... I guess adding back NS_ENSURE_STATE(mHostRecord->addr) (proposed in comment 29) would fix it. Not sure if it's better to reopen this bug, or bug 290190 (where this was added) or file a new one.
Assignee | ||
Comment 55•16 years ago
|
||
Please open a new bug and cite this bug and bug 290190. Tell us in the new bug whether the URL has a host name or literal IP address.
Comment 56•15 years ago
|
||
Seems this problem is not really solved. These crashes apparently persist. See new Bug 501446 - Crash [@ PR_EnumerateAddrInfo ] at random times
Updated•13 years ago
|
Crash Signature: [@ PR_EnumerateHostEnt]
You need to log in
before you can comment on or make changes to this bug.
Description
•