Closed Bug 337418 Opened 15 years ago Closed 14 years ago

[@ PR_EnumerateHostEnt] flaky wifi connection or vpn or laptop sleeping

Categories

(Core :: Networking, defect, P4)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9beta3

People

(Reporter: timeless, Assigned: wtc)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(6 files, 3 obsolete files)

Incident ID: 18519692
Stack Signature	PR_EnumerateHostEnt aad9b2b0
Product ID	Firefox15
Build ID	2006050410
Trigger Time	2006-05-10 04:58:57.0
Platform	Win32
Operating System	Windows NT 5.1 build 2600
Module	nspr4.dll + (0000cb31)
URL visited	no clue i was hopping from wifi network to wifi network and pac/no pac
User Comments	
Since Last Crash	70200 sec
Total Uptime	90970 sec
Trigger Reason	Access violation
Source File, Line No.	c:/builds/tinderbox/Fx-Mozilla1.8.0/WINNT_5.2_Depend/mozilla/nsprpub/pr/src/misc/prnetdb.c, line 1408
Stack Trace 	
PR_EnumerateHostEnt  [c:/builds/tinderbox/Fx-Mozilla1.8.0/WINNT_5.2_Depend/mozilla/nsprpub/pr/src/misc/prnetdb.c, line 1408]

jay: could you please pull detailed stack information for this crash?
In Firefox 1.5, PR_EnumerateHostEnt is only used in LDAP
and NSS (OCSP).  The statement at prnetdb.c, line 1408 is

1408             memcpy(&address->inet.ip, addr, hostEnt->h_length);
firefox doesn't use ldap. and i'd be fairly surprised about oscp. 

why do you say it isn't used by PR_EnumeateAdrrInfo? that's used by netwerk/dns/src/nsDNSService2.cpp (nsDNSRecord::GetNextAddr) which indeed runs on threads and is a much more likely crash point.
this is getting *very* annoying.
not sure if it's related, but I had 2 crashes w/ similar stack signatures back in June:

TB19985747 & TB19998195
finally hit this with both a firefox1.8.0.7 and mcsmurf build. here's the output from the mcsmurf build. it shows that i was correct in my analysis in comment 2.

(ce0.a24): Break instruction exception - code 80000003 (first chance)
eax=7ffde000 ebx=00000001 ecx=00000002 edx=00000003 esi=00000004 edi=00000005
eip=7c901230 esp=017bffcc ebp=017bfff4 iopl=0         nv up ei pl zr na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=0038  gs=0000             efl=00000246
ntdll!DbgBreakPoint:
7c901230 cc               int     3
0:001> g
(ce0.16ec): Access violation - code c0000005 (!!! second chance !!!)
eax=125bf946 ebx=125bf942 ecx=00000001 edx=00000000 esi=125bf942 edi=0537c67c
eip=78144e38 esp=019afdb4 ebp=019afdbc iopl=0         nv up ei ng nz ac po cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000297
MSVCR80!UnwindUpVec+0x50:
78144e38 8b448efc         mov   eax,[esi+ecx*4-0x4] ds:0023:125bf942=????????
0:001> !analyze -v -f
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


FAULTING_IP: 
MSVCR80!UnwindUpVec+50 [F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322]
78144e38 8b448efc         mov     eax,[esi+ecx*4-0x4]

EXCEPTION_RECORD:  ffffffff -- (.exr ffffffffffffffff)
ExceptionAddress: 78144e38 (MSVCR80!UnwindUpVec+0x00000050)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 125bf942
Attempt to read from address 125bf942

FAULTING_THREAD:  000016ec

DEFAULT_BUCKET_ID:  APPLICATION_FAULT

PROCESS_NAME:  firefox.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". 

The memory could not be "%s".

READ_ADDRESS:  125bf942 

BUGCHECK_STR:  ACCESS_VIOLATION

LAST_CONTROL_TRANSFER:  from 3000d5d5 to 78144e38

STACK_TEXT:  
019afdbc 3000d5d5 0537c67c 125bf942 00000004 MSVCR80!UnwindUpVec+0x50 

[F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322]
019afddc 3000e408 00000002 02985408 000001bb nspr4!PR_EnumerateHostEnt+0x95 [h:\mozilla\tree-

main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 1414]
019afdf0 0043031c 00000002 02985008 000001bb nspr4!PR_EnumerateAddrInfo+0x28 [h:\mozilla\tree-

main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 2154]
019afe10 00459ed4 04585438 000001bb 0537c678 firefox!nsDNSRecord::GetNextAddr+0x30 

[h:\mozilla\tree-main\mozilla\netwerk\dns\src\nsdnsservice2.cpp @ 116]
019afe34 0045a414 0154e0a0 0154e0c8 0537c620 firefox!nsSocketTransport::RecoverFromError+0x86 

[h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1218]
019afe48 00431498 036bad18 00000000 0154e0a0 firefox!nsSocketTransport::OnSocketDetached+0x3c 

[h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1534]
019afe60 004318c5 0154e0c8 00000001 0154e0a0 firefox!nsSocketTransportService::DetachSocket+0x30 

[h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 163]
019afe84 00431930 00000002 00000000 0154ebc8 firefox!

nsSocketTransportService::DoPollIteration+0x22f [h:\mozilla\tree-

main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 641]
019afe98 0029e39d 0154e0a8 0154ebc8 00000001 firefox!

nsSocketTransportService::OnProcessNextEvent+0x40 [h:\mozilla\tree-

main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 484]
019afec4 0027582f 0154ebc8 00000001 019afee0 xpcom_core!nsThread::ProcessNextEvent+0x53 

[h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 473]
019afed8 004315d9 0154ebc8 00000001 0154e730 xpcom_core!NS_ProcessNextEvent_P+0x27 

[h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225]
019afefc 0029e3df 0154e0ac 002d133c 00000000 firefox!nsSocketTransportService::Run+0x86 

[h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 525]
019aff1c 0027582f 0154ebc8 00000001 019aff38 xpcom_core!nsThread::ProcessNextEvent+0x95 

[h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 483]
019aff30 0029e2e8 00000001 00000001 014139fc xpcom_core!NS_ProcessNextEvent_P+0x27 

[h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225]
019aff50 30015e7b 0154ebc8 01354410 01354410 xpcom_core!nsThread::ThreadFunc+0x5c [h:\mozilla\tree

-main\mozilla\xpcom\threads\nsthread.cpp @ 251]
019aff64 7c910732 00000006 30017f2d 014d1e58 nspr4!_PR_NativeRunThread+0xdb [h:\mozilla\tree-

main\mozilla\nsprpub\pr\src\threads\combined\pruthr.c @ 458]
019affac 78132a36 00000006 7c80b683 01537028 ntdll!RtlpAllocateFromHeapLookaside+0x42
019affbc 7c910732 00000006 01537028 7ffdb000 MSVCR80!_threadstartex+0x66 

[f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 326]
019affec 00000000 781329d0 01537028 00000000 ntdll!RtlpAllocateFromHeapLookaside+0x42


FOLLOWUP_IP: 
MSVCR80!UnwindUpVec+50 [F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322]
78144e38 8b448efc         mov     eax,[esi+ecx*4-0x4]

FAULTING_SOURCE_CODE:  


SYMBOL_STACK_INDEX:  0

FOLLOWUP_NAME:  MachineOwner

SYMBOL_NAME:  MSVCR80!UnwindUpVec+50

MODULE_NAME:  MSVCR80

IMAGE_NAME:  MSVCR80.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  4333a455

STACK_COMMAND:  ~1s ; kb

FAILURE_BUCKET_ID:  ACCESS_VIOLATION_MSVCR80!UnwindUpVec+50

BUCKET_ID:  ACCESS_VIOLATION_MSVCR80!UnwindUpVec+50

Followup: MachineOwner
---------
for reference comment 5 details thread 1 (16ec), the following is thread 2:

ntdll!KiFastSystemCallRet
ntdll!ZwWaitForMultipleObjects+0xc
kernel32!WaitForMultipleObjectsEx+0x12c
kernel32!WaitForMultipleObjects+0x18
firefox!nsNotifyAddrListener::Run(void)+0x66 [h:\mozilla\tree-main\mozilla\netwerk\system\win32\nsnotifyaddrlistener.cpp @ 320]
xpcom_core!nsThread::ProcessNextEvent(int mayWait = 1, int * result = 0x01abff38)+0x95 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 483]
xpcom_core!NS_ProcessNextEvent_P(class nsIThread * thread = 0x00000001, int mayWait = 1)+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225]
xpcom_core!nsThread::ThreadFunc(void * arg = 0x015385d0)+0x5c [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 251]
nspr4!_PR_NativeRunThread(void * arg = 0x78132a36)+0xdb [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\threads\combined\pruthr.c @ 458]
nspr4!pr_root(void * arg = 0x78132a36)+0xd [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\md\windows\w95thred.c @ 120]
MSVCR80!_callthreadstartex(void)+0x1b [f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 348]
MSVCR80!_threadstartex(void * ptd = 0x00000000)+0x66 [f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 326]
kernel32!BaseThreadStart+0x37


and this is thread 1 again but with arguments:

MSVCR80!UnwindUpVec+0x50 [F:\RTM\vctools\crt_bld\SELF_X86\crt\src\intel\memcpy.asm @ 322]
nspr4!PR_EnumerateHostEnt(int enumIndex = 43536408, struct PRHostEnt * hostEnt = 0x00040002, unsigned short port = 0x501c, union PRNetAddr * address = 0x00000000)+0x95 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 1414]
nspr4!PR_EnumerateAddrInfo(void * iterPtr = 0x00459ed4, struct PRAddrInfo * base = 0x04585438, unsigned short port = 0x1bb, union PRNetAddr * result = 0x0537c678)+0x28 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @ 2154]
firefox!nsDNSRecord::GetNextAddr(unsigned short port = 0x1bb, union PRNetAddr * addr = 0x0537c678)+0x30 [h:\mozilla\tree-main\mozilla\netwerk\dns\src\nsdnsservice2.cpp @ 116]
firefox!nsSocketTransport::RecoverFromError(void)+0x86 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1218]
firefox!nsSocketTransport::OnSocketDetached(struct PRFileDesc * fd = 0x036bad18)+0x3c [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransport2.cpp @ 1534]
firefox!nsSocketTransportService::DetachSocket(struct nsSocketTransportService::SocketContext * sock = 0x0154e0c8)+0x30 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 163]
firefox!nsSocketTransportService::DoPollIteration(int wait = 2)+0x22f [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 641]
firefox!nsSocketTransportService::OnProcessNextEvent(class nsIThreadInternal * thread = 0x0154ebc8, int mayWait = 1, unsigned int depth = 0)+0x40 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 484]
xpcom_core!nsThread::ProcessNextEvent(int mayWait = 1, int * result = 0x019afee0)+0x53 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 473]
xpcom_core!NS_ProcessNextEvent_P(class nsIThread * thread = 0x0154ebc8, int mayWait = 1)+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225]
firefox!nsSocketTransportService::Run(void)+0x86 [h:\mozilla\tree-main\mozilla\netwerk\base\src\nssockettransportservice2.cpp @ 525]
xpcom_core!nsThread::ProcessNextEvent(int mayWait = 1, int * result = 0x019aff38)+0x95 [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 483]
xpcom_core!NS_ProcessNextEvent_P(class nsIThread * thread = 0x00000001, int mayWait = 1)+0x27 [h:\mozilla\tree-main\mozilla\firefox-build\xpcom\build\nsthreadutils.cpp @ 225]
xpcom_core!nsThread::ThreadFunc(void * arg = 0x0154ebc8)+0x5c [h:\mozilla\tree-main\mozilla\xpcom\threads\nsthread.cpp @ 251]
nspr4!_PR_NativeRunThread(void * arg = 0x78132a36)+0xdb [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\threads\combined\pruthr.c @ 458]
ntdll!RtlpAllocateFromHeapLookaside+0x42
MSVCR80!_threadstartex(void * ptd = 0x00000000)+0x66 [f:\rtm\vctools\crt_bld\self_x86\crt\src\threadex.c @ 326]
ntdll!RtlpAllocateFromHeapLookaside+0x42
timeless, you're right.  Firefox calls PR_EnumerateAddrInfo,
which calls PR_EnumerateHostEnt.

As far as I can tell, it seems that we crash in memcpy because
the second argument 125bf942 (the source address for memcpy) is
a bad address.  PR_EnumerateHostEnt has two memcpy calls.  We
are crashing in the second memcpy call at line 1414 for the
IPv4 case:

1412              PR_ASSERT(AF_INET == hostEnt->h_addrtype);
1413              address->inet.port = htons(port);
1414              memcpy(&address->inet.ip, addr, hostEnt->h_length);

As I said above, I believe 'addr' is a bad address.

I suggest that you or Darin artificially cause the following
code path to be executed in a debugger and see what happens:

nspr4!PR_EnumerateHostEnt

nspr4!PR_EnumerateAddrInfo

firefox!nsDNSRecord::GetNextAddr

firefox!nsSocketTransport::RecoverFromError

firefox!nsSocketTransport::OnSocketDetached

firefox!nsSocketTransportService::DetachSocket

firefox!nsSocketTransportService::DoPollIteration

firefox!nsSocketTransportService::OnProcessNextEvent
timeless, your thread 1 stack with arguments in comment 6 seems
strange in these three stack frames at the bottom:

nspr4!PR_EnumerateHostEnt(int enumIndex = 43536408, struct PRHostEnt * hostEnt
= 0x00040002, unsigned short port = 0x501c, union PRNetAddr * address =
0x00000000)+0x95 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @
1414]
nspr4!PR_EnumerateAddrInfo(void * iterPtr = 0x00459ed4, struct PRAddrInfo *
base = 0x04585438, unsigned short port = 0x1bb, union PRNetAddr * result =
0x0537c678)+0x28 [h:\mozilla\tree-main\mozilla\nsprpub\pr\src\misc\prnetdb.c @
2154]
firefox!nsDNSRecord::GetNextAddr(unsigned short port = 0x1bb, union PRNetAddr *
addr = 0x0537c678)+0x30
[h:\mozilla\tree-main\mozilla\netwerk\dns\src\nsdnsservice2.cpp @ 116]

The 'unsigned short port' argument and the 'union PRNetAddr *' argument
change values when they reach the PR_EnumerateHostEnt function.  This doesn't
match the PR_EnumerateAddrInfo source code, which passes those two values
unchanged to PR_EnumerateHostEnt.  If the 'union PRNetAddr * address'
argument to PR_EnumerateHostEnt really is 0x00000000, we would have
crashed earlier at the memset call at line 1398:

1398      memset(address, 0, sizeof(PRNetAddr));
I'm aware of that. I'm at home, I'll be at work shortly. I had the disassembly for that frame in notepad, I'll attach it if I remember.

please keep in mind that windbg shows the value of the variable which is aliased as the argument at that frame, if some optimization results in the argument's slot being used for some other purpose, then *poof*, you'll see something strange. you should always assume that the correct argument was passed from the calling frame and that the variable has since been modified and what's being shown is either the local value of the argument or some other variable that shared the register.
I'm seeing this too, in our 3rd party product, using MOZILLA_1_8_BRANCH:

Crashing instruction:

  memcpy(&address->inet.ip, addr, hostEnt->h_length);

Crashing thread 14
00 nspr4!PR_EnumerateHostEnt(int enumIndex = <Memory access error>, struct PRHostEnt * hostEnt = <Memory access error>, unsigned short port = <Memory access error>, union PRNetAddr * address = <Memory access error>)+0x80
01 nspr4!PR_EnumerateAddrInfo(void * iterPtr = 0x00c47a3f, struct PRAddrInfo * base = 0x0581f2d8, unsigned short port = 0x50, union PRNetAddr * result = 0x057606a8)+0x28
02 necko!nsDNSRecord::GetNextAddr(unsigned short port = 0x50, union PRNetAddr * addr = 0x057606a8)+0x29
03 necko!nsSocketTransport::RecoverFromError(void)+0x8a
04 necko!nsSocketTransport::OnSocketDetached(struct PRFileDesc * fd = 0x05824e70)+0x3d
05 necko!nsSocketTransportService::DetachSocket(struct nsSocketTransportService::SocketContext * sock = 0x008ebdc6)+0x31
06 necko!nsSocketTransportService::Run(void)+0x238
07 xpcom_core!nsThread::Main(void * arg = 0x7c80b683)+0x1e
08 nspr4!_PR_NativeRunThread(void * arg = 0x7c80b683)+0xd1
09 js3250!JS_DHashTableOperate(struct JSDHashTable * table = 0x00000004, void * key = 0x05760650, JSDHashOperator op = 11569720 (No matching enumerant))+0xe1
0a kernel32!BaseThreadStart+0x37
Is this a dup of bug 290190?
QA Contact: wtchang → nspr
I got this crash also on: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; es-ES; rv:1.9a9pre) Gecko/2007101904 Minefield/3.0a9pre

Just surfing around on gmail, and my home network went down.  The crash happened afterwards. 

Crash reporter: http://crash-stats.mozilla.com/report/index/8eaf80c0-7f28-11dc-97b2-001a4bd46e84?date=2007-10-20-16

Duplicate of this bug: 401500
(In reply to comment #13)
> *** Bug 401500 has been marked as a duplicate of this bug. ***
> 

Bug 401500 was a Linux crash, not WinXP. Comment #12 is for an Intel Mac. Please set this bug's OS to ALL.
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; es-ES; rv:1.9a9pre) Gecko/2007102904 Minefield/3.0a9pre

I repro'd this again on intel mac today.  This is becoming a common crash for me, as my mbp has a flaky wireless card to begin with.  My common steps is my Airport wifi will drop connection, and in the middle of surfing on the browser, it will crash and generate this stack: 

http://crash-stats.mozilla.com/report/index/9cc4d147-8657-11dc-bb04-001a4bd43e5c?date=2007-10-29-19

This has quietly made it up to #13 on Top Crashes.  Recommending blocking1.9 flag for M9.

Flags: blocking1.9?
OS: Windows XP → All
Hardware: PC → All
This bug should be investigated as a Core:Networking bug.

Tony (Chung), if you can reproduce this crash easily, could
you cause a debug build to crash in the debugger and find out
which of the three pointer dereferences on this line is the
culprit?

prnetdb.c:2098
      result->raw.family = ai->ai_addr->sa_family;

In other words, please find out whether 'result',
'ai', or 'ai->ai_addr' is a bad pointer when Firefox crashes.

If you don't have the setup to do that, I will check in a
patch to do one pointer dereferencing per line and ask you
to crash Firefox and submit a Talkback report.
I dont have a debug environment running here.  Since trunk is using breakpad, i wont be able to run talkback since i'm testing on trunk nightlies.  Are these stack traces not enough information?   I can try arranging to use the apple crash reporter as well if you need more data.

Otherwise, i'd be happy to continue reporting these crashes if it still repros with your patch.
Tony,

I meant breakpad crash reports.  Sorry.

I looked at the source code for an hour.  The only problem I found
is that in a nsDNSRecord, mIter and mHostRecord->addr_info will get
out of sync if nsHostResolver::OnLookupComplete is called on
the nsHostRecord that mHostRecord points to:

nsHostResolver.cpp:

602 void
603 nsHostResolver::OnLookupComplete(nsHostRecord *rec, nsresult status, PRAddrInfo *result)
604 {
            ...
615         // update record fields.  We might have a rec->addr_info already if a
616         // previous lookup result expired and we're reresolving it..
617         if (rec->addr_info)
618           PR_FreeAddrInfo(rec->addr_info);
619         rec->addr_info = result;

When this happens, we need to invalidate the mIter fields of all nsDNSRecord
objects whose mHostRecord points to rec.  But I haven't figured out how to
do that.  Also, after invalidating mIter, I am not sure if rewinding mIter
(resetting it to nsnull) is the right action because it'll likely cause
us to retry an address that we used before.

I have a hunch that I'm on the right track, and hope that someone more
familiar with the networking code can figure out what's going on from this
description.
(In reply to comment #18)
[...]
> I looked at the source code for an hour.  The only problem I found
> is that in a nsDNSRecord, mIter and mHostRecord->addr_info will get
> out of sync if nsHostResolver::OnLookupComplete is called on
> the nsHostRecord that mHostRecord points to:
[...]

Huh-uh. Two days ago I had a similar crash (see bug 401500, duped to this one), for the first time in my life, on a desktop computer with a permanent "copper" connection and in a city with a good power supply, but at a time when up to 5 tabs, including 3 for tinderbox.mozilla.org, might have decided to auto-refresh. A race condition such as what you're describing sounds like a plausible explanation (to me, who don't know Mozilla code).
Assignee: wtc → nobody
Component: NSPR → Networking
Product: NSPR → Core
QA Contact: nspr → networking
Version: other → Trunk
I can't figure out a way to cause Firefox to crash there.  If my theory is correct,
the crash is caused by reading from memory that has been freed.  But freed memory
reads don't always result in a crash.  :-(
Got another crash today.   Same thing, was surfing around and minding my own business, and my wifi on my router dropped connection.  Crash happens shortly after Firefox tried to connect to the server.

It doesnt happen everytime my wifi drops, but its the third time i've already seen this crash report.

Running Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; es-ES; rv:1.9a9pre) Gecko/2007110104 Minefield/3.0a9pre.  

Breakpad Report: http://crash-stats.mozilla.com/report/index/3449ede1-8893-11dc-af4e-001a4bd46e84?date=2007-11-01-15


Duplicate of this bug: 290190
The crash reports in this bug and the crash stats provided by
Tony Mechelynck in bug 401500 comment 1 are strong support for
my theory that it is mIter that is bad.  All the crashes occured
on IPv6-enabled systems (which is why most of them are Mac OS X
and Linux) where mIter is a pointer to a node in a linked list.
If the linked list changes, the pointer becomes invalid.  On
other systems, mIter is an integer index into an array.  If
the array changes but the index is still within the bounds of
the new array (which is likely), the index is still "valid",
hence no crash.

I will try to spend some time this weekend putting together
a defensive patch.
This patch makes PR_EnumerateHostEnt and PR_EnumerateAddrInfo
validate their enumerator/iterator argument.  We should use this
patch only as a last resort because it punishes all callers of
these functions.  On the other hand, the overhead of the index/pointer
validation should be low for most hosts because most hosts have only
a small number of addresses.
Tony Chung, and anyone who encounters this crash on Mac OS X Intel:

Please replace the libnspr4.dylib in your Firefox 3 trunk build with
this libnspr4.dylib.  Let me know if this makes the crash go away.
If it does, this will be another strong support for my theory.
Hi Wan-teh, Thanks for taking the time to investigate and work on this.   i will try out .dylib file locally against my nightly builds, and monitor it for a few days to see if it still crashes.   Will let you know if i see crashes.
(In reply to comment #24)
> Created an attachment (id=287197)
> Last resort: PR_EnumerateAddrInfo validates its iterPtr argument
> 
> This patch makes PR_EnumerateHostEnt and PR_EnumerateAddrInfo
> validate their enumerator/iterator argument.  We should use this
> patch only as a last resort because it punishes all callers of
> these functions.  On the other hand, the overhead of the index/pointer
> validation should be low for most hosts because most hosts have only
> a small number of addresses.
> 

Hm, I see. Not an elegant solution, as you said; but I guess that, unless you can come up with something better, a small slowdown is better than a big crash. ;-)
-- a couple of "style nits":
   - shouldn't the condition in the "for" statement be *ap != NULL rather than != 0 ? (this, I think, doesn't change the logic, only makes it clearer)
   - more important: if na is the _number_ of entries, isn't there an off-by-one error in the validation of enumIndex? (< 0 or >= na)
Comment on attachment 287197 [details] [diff] [review]
Last resort: PR_EnumerateAddrInfo validates its iterPtr argument

Tony M.: thanks for the review comments.

Re: style: this for loop was copied from elsewhere in the same file.
I'm following the coding style of the original author.  It's more
important to use a consistent style than my favorite style.

Re: off-by-one error: it is better to think of na as the index of
the last array element than the number of elements.  Then it's
clear that a valid enumIndex should satisfy 0 <= enumIndex <= na.

Why is na the index of the last array element?  hostEnt->h_addr_list
is an array of (char *) pointers, with the last array element being
NULL.  In the for loop, we move ap and na in locked step.  So when
*ap is NULL (the last array element), na is the index of that element.
Attached patch Proposed patch (not quite ready) (obsolete) — Splinter Review
[Note: I can build nspr4.dll for Windows or libnspr4.so for Linux if
you'd like to test my "last resort" NSPR patch.  Just send me email.]

This patch uses a generation count to detect that mHostRecord->addr_info
has changed since we last used mIter.  Given my limited knowledge of
NetLib, this is the best solution I can come up with.

I have three questions.

1. I backed out Darin's "bandaid" patch (attachment 199372 [details] [diff] [review]) from bug 290190.
Should I leave the NS_ENSURE_STATE(mHostRecord->addr) there and just remove
the comment "but see bug 290190"?

2. When we detect that mHostRecord->addr_info has changed, should we restart
the iteration or just fail?

3. Does nsHostRecord need to be protected with a lock?  I know it is used
by at least two threads (the nsHostResolver worker thread and the thread
that calls nsSocketTransportService::DoPollIteration), but I don't know
whether these two threads may access nsHostRecord simultaneously or
nsHostRecord is always handed off from one thread to another.
(In reply to comment #28)
> (From update of attachment 287197 [details] [diff] [review])
> Tony M.: thanks for the review comments.
> 
> Re: style: this for loop was copied from elsewhere in the same file.
> I'm following the coding style of the original author.  It's more
> important to use a consistent style than my favorite style.

Well... OK

> 
> Re: off-by-one error: it is better to think of na as the index of
> the last array element than the number of elements.  Then it's
> clear that a valid enumIndex should satisfy 0 <= enumIndex <= na.
> 
> Why is na the index of the last array element?  hostEnt->h_addr_list
> is an array of (char *) pointers, with the last array element being
> NULL.  In the for loop, we move ap and na in locked step.  So when
> *ap is NULL (the last array element), na is the index of that element.
> 

The terminating NULL is a valid array element, is that it? Array bound checking has always been one of my worst nightmares.

(In reply to comment #29)
[...]
> [Note: I can build nspr4.dll for Windows or libnspr4.so for Linux if
> you'd like to test my "last resort" NSPR patch.  Just send me email.]
[...]
Good to know. I won't do it yet, because I've only had the crash once, and it would mean overwriting /usr/local/seamonkey/libnspr4.so again after each nightly install, but if I start crashing this way repeatedly I'll write -- unless you upload it somewhere first. :-) :-) :-)
Yes, the terminating NULL is the last valid array element.
Whenever an invalid pointer or array index is detected, we
write a line to the log file /tmp/wtc-getnextaddr.log.
Tony C.: please use this libnspr4.dylib instead.  It writes a line
to the log file /tmp/wtc-getnextaddr.log whenever it detects a bad
pointer.

Please run the Firefox 3 trunk build with this libnspr4.dylib and
report back in a week or two whether the log file
/tmp/wtc-getnextaddr.log is created.
Attachment #287198 - Attachment is obsolete: true
Flags: blocking1.9? → blocking1.9+
wtc: fwiw my wXP laptop does not seem to have ipv6 enabled
wtc: just fyi i replaced did libnspr4.dylib file for a day so far, and no crash to reproduce yet.  Will keep trying.

Also, i can try running windows side by side too, so if you have the .dll handy, just attach it here and i'll do the same.

question, if i update to nightly builds, i'll need to manually replace the .dylib file each time afterwards, correct?

Tony C: yes, whenever you update to a new nightly build, you need to
manually replace the libnspr4.dylib file.

Please also check /tmp/wtc-getnextaddr.log regularly, and report back
as soon as the log file is created.  Thanks.

Here is the libnspr4.so for Linux x86.  I will provide the Windows DLL
later today.
Priority: -- → P1
So this is currently identified as a P1 blocker, which means that it needs to be fixed by 1.9b2, in about a month.  wtc: it looks like you're making progress here, can you take this bug?
I'm working on this bug in my spare time (weekends).  I can't
reproduce the crash, which makes debugging very difficult.
All I have now is a guess of the cause of the crash.  Unless
Tony C. reports back that a /tmp/wtc-getnextaddr.log file is
created on his system by my libnspr4.dylib file, I am blocked.

I'd also need the real Mozilla developers to answer my questions
in comment 29.  Thanks!
Status: NEW → ASSIGNED
ive ran the nightlies this week on mac and the attachment wtc put in, but still no luck yet on a crash.  I'll be running it this week against the beta 1 candidate.  Will let you know once i get a repro.
Assignee: nobody → wtc
Status: ASSIGNED → NEW
wtc:  it seems like it would be better to deep copy the addrinfo from nspr somehow.  e.g., perform the complete enumeration and save the results.  then nsDNSRecord can just serve the copy of the results.   that would simplify all of the issues related to multithreading.
This is not a blocker, it's not a regression.  We'll take a patch if you have one.  
Flags: blocking1.9+ → blocking1.9-
Priority: P1 → P4
Attached patch Proposed patch (obsolete) — Splinter Review
Darin, thank you for the comment.  Saving a deep copy of the
addrinfo from NSPR in nsDNSRecord will consume more memory
(popular websites have multiple IP addresses) and defeat the
expiration of the addrinfo in nsHostRecord.  Each PRNetAddr
is ~ 108 bytes.

We can limit the memory overhead by copying only two or three
IP addresses, which should be sufficient in practice, but I'm
worried about not doing expiration.  On the other hand, serving
the copy of the results will avoid the awkward issue of
restarting the iteration when the addrinfo in nsHostRecord has
changed.  Do you think only saving 2 or 3 addresses and not
doing expiration is OK?

I decided to finish my patch by adding a lock to each
nsHostRecord.  I was worried about the memory overhead of
one PRLock per nsHostRecord, but it cannot be worse than
the deep copying approach.

Questions:
1. In nsDNSRecord::GetCanonicalName, does result.Assign(cname)
copy the |cname| string into |result|?  I browsed around in
LXR but couldn't find the implementation of nsACString::Assign().

2. What's the name of the thread that calls nsDNSRecord::GetCanonicalName
and nsDNSRecord::GetNextAddr?  In this comment I call it Thread X.

3. I studied the source code and found that nsHostRecord is
used by the nsHostResolver worker thread and Thread X, at
least.  Thread X only accesses the following fields:
- flags
- host
- addr_info
- addr

The |flags|, |host|, and |addr| fields do not change
after they have been assigned values.  This is why
|addr_info_lock| only needs to protect |addr_info|
and the |addr_info_gencnt| that I added.

In addition, Thread X only reads nsHostRecord.  Only
the nsHostResolver worker thread modifies nsHostRecord
(and only in nsHostResolver::OnLookupComplete).
This is relevant because the worker thread doesn't need
to lock when reading the addrinfo.

Please verify the above description of thread access to
nsHostRecord, because it determines the correctness of
my use of locking.  I will add a concise summary as a
comment to nsHostResolver.h.

4. I use PR_Lock/PR_Unlock so I can hold the lock for
the shortest time.  If you'd prefer that I use nsAutoLock,
I can do that.
Attachment #287199 - Attachment is obsolete: true
Attachment #288176 - Flags: superreview?(darin.moz)
Attachment #288176 - Flags: review?(cbiesinger)
Attachment #288176 - Attachment is patch: true
Attachment #288176 - Attachment mime type: application/octet-stream → text/plain
jst and I just spent a while coming up with basically the conclusions in this bug, with the current crash in fact being a regression from the patch in bug 397716.  Before that patch, we would just leak the old addr info in the situation described in comment 18.  Now we free a data structure that someone might still be referencing, leading to a crash.  Renominating based on that, since at this point that crash seems to be what this bug is about.  There's quite a number of breakpad reports for this.

My initial instinct for fixing this was to leave existing consumers who are holding long-term references to nsHostRecords with stale information and just create a new nsHostRecord if we have to resolve and we're not the only ones with a ref to the nsHostRecord.  If people feel that this is a better solution, I can attach the patch I have, but personally I like wtc's approach more.
Blocks: 397716
Flags: blocking1.9- → blocking1.9?
Keywords: regression
Boris, your approach seems similar to what Darin proposed in comment 41.
The advantage of your approaches is that we don't need to deal with the
issue of how to restart a PRAddrInfo enumeration with a new nsHostRecord.
Right.  And we'd only copy if the host record expired and someone was still holding on to it.

I guess it's a matter of what contract the DNS service exposes...
+ing.
Flags: blocking1.9? → blocking1.9+
wtc, just wanted to let u know i still run the .dylib file against my nightly builds.  i just have not found a repro crash since my last comment yet, so still investigating. 
(In reply to comment #43)
> 1. In nsDNSRecord::GetCanonicalName, does result.Assign(cname)
> copy the |cname| string into |result|?  I browsed around in
> LXR but couldn't find the implementation of nsACString::Assign().

The implementation is at http://lxr.mozilla.org/seamonkey/source/xpcom/string/src/nsTAString.cpp#227, usually leading to http://lxr.mozilla.org/seamonkey/source/xpcom/string/src/nsTSubstring.cpp#374

It just adds a reference to the buffer when possible; otherwise copies (strings are copy-on-write)

> 2. What's the name of the thread that calls nsDNSRecord::GetCanonicalName
> and nsDNSRecord::GetNextAddr?  In this comment I call it Thread X.

GetNextAddr is often called by the socket transport thread, but it can be called by PAC and other code on the main thread as well.

> Please verify the above description of thread access to
> nsHostRecord, because it determines the correctness of
> my use of locking.  I will add a concise summary as a
> comment to nsHostResolver.h.

The description is correct.

> 4. I use PR_Lock/PR_Unlock so I can hold the lock for
> the shortest time.  If you'd prefer that I use nsAutoLock,
> I can do that.

I prefer that. You can use something like:
  {
    nsAutoLock autolock(foo);
    // ...
   }
to limit the amount of time you hold the lock.


Ah, I see. I wrote that last comment before reading the patch. nsAutoLock wouldn't really work for the nsDNSRecord::GetNextAddr; feel free to keep the PR_Lock/PR_Unlock there.
Attachment #288176 - Flags: review?(cbiesinger) → review+
Comment on attachment 288176 [details] [diff] [review]
Proposed patch

Boris, any chance you could sr this? If we get this reviewed real soon, we could try to get this in for beta2.

Darin, if you're listening and you have the cycles, please don't let me stop you from sr:ing here :)
Attachment #288176 - Flags: superreview?(darin.moz) → superreview?(bzbarsky)
Comment on attachment 288176 [details] [diff] [review]
Proposed patch

>Index: netwerk/dns/src/nsDNSService2.cpp

I don't see anything preventing Rewind() getting called while we're in the middle of GetNextAddr() (from different threads).  Perhaps there should be?  File a followup bug, since that's been a problem all along?

Other than that, looks great.  sr=bzbarsky.
Attachment #288176 - Flags: superreview?(bzbarsky) → superreview+
Boris: good question.  I'm assuming that GetNextAddr() and Rewind() are
called from only one thread, and therefore it's not necessary to use a lock
to protect access to mIter and mDone.  Also, this LXR query seems to show
that Rewind() is defined but not used:

http://lxr.mozilla.org/seamonkey/ident?i=Rewind

I checked in this patch, after updating the comments in nsHostResolver.h.

Checking in nsDNSService2.cpp;
/cvsroot/mozilla/netwerk/dns/src/nsDNSService2.cpp,v  <--  nsDNSService2.cpp
new revision: 1.14; previous revision: 1.13
done
Checking in nsHostResolver.cpp;
/cvsroot/mozilla/netwerk/dns/src/nsHostResolver.cpp,v  <--  nsHostResolver.cpp
new revision: 1.21; previous revision: 1.20
done
Checking in nsHostResolver.h;
/cvsroot/mozilla/netwerk/dns/src/nsHostResolver.h,v  <--  nsHostResolver.h
new revision: 1.9; previous revision: 1.8
done
Attachment #288176 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M10
Target Milestone: mozilla1.9 M10 → mozilla1.9 M11
With my trunk build (which is already a bit old: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9b4pre) Gecko/2008022621 Minefield/3.0b4pre) I sometimes get crashes with a stack like this one:

Thread 1 Crashed:
0   <<00000000>>   0xffff07fc __memcpy + 92 (cpu_capabilities.h:228)
1   libnecko.dylib 0x143e08aa nsDNSRecord::GetNextAddr(unsigned short, PRNetAddr*) + 460 (nsDNSService2.cpp:143)
2   libnecko.dylib 0x143cf4be nsSocketTransport::RecoverFromError() + 426 (nsSocketTransport2.cpp:1242)
3   libnecko.dylib 0x143cf61d nsSocketTransport::OnSocketDetached(PRFileDesc*) + 173 (nsSocketTransport2.cpp:1559)
...

I guess adding back NS_ENSURE_STATE(mHostRecord->addr) (proposed in comment 29) would fix it.

Not sure if it's better to reopen this bug, or bug 290190 (where this was added) or file a new one.
Please open a new bug and cite this bug and bug 290190.
Tell us in the new bug whether the URL has a host name
or literal IP address.
Depends on: 426060
Seems this problem is not really solved.  These crashes apparently persist.
See new Bug 501446 - Crash [@ PR_EnumerateAddrInfo ] at random times
Crash Signature: [@ PR_EnumerateHostEnt]
You need to log in before you can comment on or make changes to this bug.