Closed Bug 374829 Opened 18 years ago Closed 17 years ago

make trace-malloc use the XPCOM stack walking code

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9alpha8

People

(Reporter: dbaron, Assigned: dbaron)

References

Details

Attachments

(2 files, 6 obsolete files)

25.00 KB, patch
benjamin
: review+
bzbarsky
: approval1.9+
Details | Diff | Splinter Review
67.90 KB, patch
brendan
: review+
brendan
: review+
Details | Diff | Splinter Review
I want to make trace-malloc use the XPCOM stack walking code, so that we only need to make platform improvements in one place to get stack walking working in both places.

This depends on bug 374689, plus requires splitting the API I create there into two separate methods for getting the stack as a series of addresses and converting an address into a string (symbol, file/line info if present).
Note that when we do this we should remove the not-yet-hooked up Windows stack walking code in trace-malloc (of which there are two instances, the entire nsStackFrameWin.cpp file, and the ifdef-ed code in nsWinTraceMalloc).
Applies on top of attachment 259278 [details] [diff] [review].

Build and tested on Linux, partly built on Windows.  Will test more tomorrow.
Applies on top of attachment 259278 [details] [diff] [review] and attachment 260436 [details] [diff] [review].

Build and tested on Linux, partly built on Windows.  Will test more tomorrow.

Also contains fix for bug 363334.
I probably should add some comments documenting a few things about the big chunks  of code that I'm removing:
 (1) why we don't use the standard malloc-hooking mechanism on Windows
 (2) that we had some code to use libbfd that could be resurrected
And for what it's worth, I'm running into some issues on Windows:

 * the stack walking code's thread code fails at one point during startup (not the first time), though.  The SuspendThread call fails (returns -1) because "The handle is invalid" and the associated (I think, although the printfs come out in the wrong order) WaitForSingleObject call times out (fails).

 * around the same time (related?), the load of necko.dll fails, which causes some assertions in the pref service (the load is triggered by the NS_NewLocalFileInputStream call in openPrefFile).  Necko successfully loads later on, though.  It's failing because PR_LoadLibraryWithFlags returns null (in nsLocalFile::Load in nsLocalFileWin.cpp) -- haven't looked into it beyond that.

 * we crash late in shutdown with:

ntdll.dll!7c901230()
nspr4.dll!PR_Assert(const char * s=0x002ae4ec, const char * file=0x002ae448, int ln=216)  Line 546
nspr4.dll!PR_DestroyLock(PRLock * lock=0x00ffad70)  Line 216 + 0x1f
nspr4.dll!PR_DestroyMonitor(PRMonitor * mon=0x00ffad38)  Line 86 + 0xe
tracemalloc.dll!NS_TraceMallocShutdown()  Line 1473 + 0xa
tracemalloc.dll!_CRT_INIT(void * hDllHandle=0x00420000, unsigned long dwReason=0, void * lpreserved=0x00000001)  Line 234
tracemalloc.dll!_DllMainCRTStartup(void * hDllHandle=0x00420000, unsigned long dwReason=0, void * lpreserved=0x00000001)  Line 288 + 0x11
ntdll.dll!7c9011a7()
ntdll.dll!7c923f31()
ntdll.dll!7c910945()
ntdll.dll!7c91094e()
kernel32.dll!7c81cd76()
msvcr71d.dll!_heap_alloc_dbg(unsigned int nSize=2147351552, int nBlockUse=22, const char * szFileName=0x0012fef8, int nLine=44679)  Line 448 + 0x18
ntdll.dll!7c90f0aa()
kernel32.dll!7c81cdee()
msvcr71d.dll!__crtExitProcess(int status=0)  Line 464
msvcr71d.dll!doexit(int code=0, int quick=0, int retcaller=0)  Line 414 + 0x9C

msvcr71d.dll!exit(int code=0)  Line 303 + 0xd
firefox.exe!mainCRTStartup()  Line 406
kernel32.dll!7c816fd7()

Even without the patch I see crashes during shutdown with --shutdown-leaks (writing the allocation dump, which is earlier than the crash mentioned above) while trying to write out the contents of the memory for some of the leaked objects.  The allocation stack for the memory that it's trying to write out is:
CallocCallback
dhw_calloc
XPT_ArenaMalloc
xptiInterfaceEntry::newEntry
xptiManifest::Read
xptiInterfaceInfoManager::GetInterfaceInfoManagerNoAddRef
NS_InitXPCOM3_P
ScopedXPCOMStartup::Initialize
XRE_main
main
which I've seen 2 times in a row.  (But I think I was seeing a different stack, also reliably, a few days ago.)
(In reply to comment #5)
> And for what it's worth, I'm running into some issues on Windows:
> 
>  * the stack walking code's thread code fails at one point during startup (not
> the first time), though.  The SuspendThread call fails (returns -1) because
> "The handle is invalid" and the associated (I think, although the printfs come
> out in the wrong order) WaitForSingleObject call times out (fails).
> 
>  * around the same time (related?), the load of necko.dll fails, which causes
> some assertions in the pref service (the load is triggered by the
> NS_NewLocalFileInputStream call in openPrefFile).  Necko successfully loads
> later on, though.  It's failing because PR_LoadLibraryWithFlags returns null
> (in nsLocalFile::Load in nsLocalFileWin.cpp) -- haven't looked into it beyond
> that.

I managed to get a usable stack related to this problem, which shows that the SuspendThread call failure is while running static constructors for necko.dll:

>	xpcom_core.dll!NS_StackWalk(void (void *, void *)* aCallback=0x00422560, unsigned int aSkipFrames=1, void * aClosure=0x00000000)  Line 477	C++

 	tracemalloc.dll!backtrace(int skip=1)  Line 826 + 0x11	C

 	tracemalloc.dll!CallocCallback(void * ptr=0x011f3c80, unsigned int count=1, unsigned int size=12, unsigned int start=1368463156, unsigned int end=1368463156)  Line 1758 + 0x7	C

 	tracemalloc.dll!dhw_calloc(unsigned int count=1, unsigned int size=12)  Line 77 + 0x19	C++

 	nspr4.dll!PR_Calloc(unsigned int nelem=1, unsigned int elsize=12)  Line 518 + 0xe	C

 	nspr4.dll!PR_NewLogModule(const char * name=0x03d970f8)  Line 369 + 0x9	C

 	necko.dll!$E1()  Line 67 + 0xe	C++

 	msvcr71d.dll!_initterm(void (void)* * pfbegin=0x03df0104, void (void)* * pfend=0x03df03dc)  Line 600	C

 	necko.dll!_CRT_INIT(void * hDllHandle=0x03c80000, unsigned long dwReason=1, void * lpreserved=0x00000000)  Line 184 + 0xf	C

 	necko.dll!_DllMainCRTStartup(void * hDllHandle=0x03c80000, unsigned long dwReason=1, void * lpreserved=0x00000000)  Line 266 + 0x11	C

 	ntdll.dll!7c9011a7() 	

 	ntdll.dll!7c91cbab() 	

 	ntdll.dll!7c91c94d() 	

 	ntdll.dll!7c91d6d2() 	

 	ntdll.dll!7c91d9cb() 	

 	ntdll.dll!7c916178() 	


So this is basically the same problem as bug 363334, except it ends up with the two failures above instead of a deadlock because we're waiting to kill one of the deadlocking threads if it times out.

I think I want to fix this by:

 * converting suppress_tracing to thread-local storage, which will remove the invariant (which we don't even currently meet) that we always hold the trace-malloc lock while suppress_tracing is nonzero.  This should fix bug 376874 (see bug 376874 comment 1).

 * making backtrace do its own locking, and requiring that callers do not hold the trace-malloc lock while calling it
Though, really, all I probably need to do to fix it without fixing bug 376874 is to TM_EXIT_MONITOR and TM_ENTER_MONITOR around the call to NS_StackWalk.
except exiting the monitor around the NS_StackWalk call didn't fix the problems...
With the patches to bug 391141, bug 376874 (2 patches), and this bug (2 patches), applied on top of each other in that order, trace-malloc works for me on Windows (after only cursory testing).
Attachment #260654 - Attachment is obsolete: true
Er, I forgot to list bug 374689 at the *beginning* of that list.
Comment on attachment 275618 [details] [diff] [review]
patch to split XPCOM stack walking API so it's usable by trace-malloc

Note that this applies on top of the patch to bug 374689, which hasn't landed yet.
Attachment #275618 - Flags: review?(benjamin)
Note that I haven't yet fixed the deadlock in comment 5, but I now understand it, and have filed it as bug 391321.
Depends on: 391321
Attachment #275618 - Flags: review?(benjamin) → review+
Comment on attachment 275670 [details] [diff] [review]
patch to make trace-malloc use XPCOM stack walking API

Quick nits, I'll take a slower pass later today:

s/stack_entries/num_stack_entries/g or stack_depth

s/stack_entry/stack_index/g

/be
Comment on attachment 275618 [details] [diff] [review]
patch to split XPCOM stack walking API so it's usable by trace-malloc

This is mostly DEBUG-only, but not completely, so requesting approval.  It's part of the plan to consolidate stack-walking code and make trace-malloc usable on platforms other than Linux.
Attachment #275618 - Flags: approval1.9?
This:
 * addresses comment 17
 * changes a bare realloc call that happens within suppression to __libc_realloc (like other such calls)
Attachment #275670 - Attachment is obsolete: true
Attachment #275907 - Flags: review?(brendan)
Attachment #275670 - Flags: review?(brendan)
Comment on attachment 275618 [details] [diff] [review]
patch to split XPCOM stack walking API so it's usable by trace-malloc

a=bzbarsky
Attachment #275618 - Flags: approval1.9? → approval1.9+
Comment on attachment 275907 [details] [diff] [review]
patch to make trace-malloc use XPCOM stack walking API

A diff -w would have helped in a few spots, but r/a=me. Great to get rid of code!

/be
Attachment #275907 - Flags: review?(brendan)
Attachment #275907 - Flags: review+
Patches checked in to trunk, 2007-08-10 14:28 -0700 and 14:29 -0700.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M8
...and 2007-08-10 15:20 -0700.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: