Closed Bug 978585 Opened 10 years ago Closed 2 years ago

Loader lock deadlock on startup when initializing NS_StackWalk

Categories

(Core :: Gecko Profiler, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED INACTIVE

People

(Reporter: ted, Assigned: vikstrous)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 2 obsolete files)

I hit a deadlock on startup, there were only two threads running: the main thread and the profiler thread. The main thread was inside a call to GetProcAddress, and the profiler thread was inside NS_StackWalk->EnsureWalkThreadReady, which was trying to call GetModuleHandle and waiting on the loader lock. The profiler thread had paused the main thread because it was profiling it, thus causing a deadlock.

Here are the stacks:
 	ntdll.dll!_ZwWaitForSingleObject@12()	Unknown
 	ntdll.dll!_RtlpWaitOnCriticalSection@8()	Unknown
 	ntdll.dll!_RtlEnterCriticalSection@4()	Unknown
 	ntdll.dll!_LdrGetDllHandleEx@20()	Unknown
 	ntdll.dll!_LdrGetDllHandle@16()	Unknown
 	KernelBase.dll!_GetModuleHandleForUnicodeString@4()	Unknown
 	KernelBase.dll!_BasepGetModuleHandleExW@16()	Unknown
 	KernelBase.dll!_GetModuleHandleW@4()	Unknown
 	msvcr100.dll!_initptd(_tiddata * ptd=0x005e7fe8, threadlocaleinfostruct * ptloci=0x005e7de8) Line 384	C
 	msvcr100.dll!_beginthreadex(void * security=0x00000000, unsigned int stacksize=0x00000000, unsigned int (void *) * initialcode=0x56f1ea3d, void * argument=0x00000120, unsigned int createflag=0x00000000, unsigned int * thrdaddr=0x038fb460) Line 161	C
 	xul.dll!EnsureWalkThreadReady() Line 274	C++
 	xul.dll!NS_StackWalk(void (void *, void *, void *) * aCallback=0x56c57fa3, unsigned int aSkipFrames=0x00000000, unsigned int aMaxFrames=0x000003e7, void * aClosure=0x038ff428, unsigned int aThread=0x00000024, void * aPlatformData=0x038ff4e8) Line 494	C++
 	xul.dll!TableTicker::doNativeBacktrace(ThreadProfile & aProfile={...}, TickSample * aSample=0x038ff4b0) Line 473	C++
 	xul.dll!TableTicker::InplaceTick(TickSample * sample=0x038ff4b0) Line 581	C++
>	xul.dll!SamplerThread::SampleContext(Sampler * sampler=0x00c17280, ThreadProfile * thread_profile=0x00c172c0) Line 181	C++
 	xul.dll!SamplerThread::Run() Line 123	C++
 	xul.dll!`anonymous namespace'::ThreadFunc(void * closure=0x00c0e940) Line 27	C++
 	msvcr100.dll!_callthreadstartex() Line 314	C
 	msvcr100.dll!_threadstartex(void * ptd=0x005e7fe8) Line 292	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown


 	ntdll.dll!_RtlImageNtHeaderEx@20()	Unknown
 	ntdll.dll!_RtlpImageDirectoryEntryToDataEx@20()	Unknown
 	ntdll.dll!_RtlImageDirectoryEntryToData@16()	Unknown
 	ntdll.dll!_LdrGetProcedureAddressEx@20()	Unknown
 	ntdll.dll!_LdrGetProcedureAddress@16()	Unknown
 	KernelBase.dll!_GetProcAddress@8()	Unknown
>	xul.dll!SetupErrorHandling(const char * progname=0x5665d0f9) Line 4416	C++
 	xul.dll!XREMain::XRE_mainInit(bool * aExitFlag=0x002ef0bf) Line 2875	C++
 	xul.dll!XREMain::XRE_main(int argc=0x00000001, char * * argv=0x00627bc8, const nsXREAppData * aAppData=0x002ef20c) Line 4043	C++
 	xul.dll!XRE_main(int argc=0x00000001, char * * argv=0x00627bc8, const nsXREAppData * aAppData=0x002ef20c, unsigned int aFlags=0x00000000) Line 4284	C++
 	firefox.exe!do_main(int argc=0x00000001, char * * argv=0x002eee00, nsIFile * xreDirectory=0x00c18100) Line 282	C++
 	firefox.exe!NS_internal_main(int argc=0x00000001, char * * argv=0x00627bc8) Line 643	C++
 	firefox.exe!wmain(int argc=0x00000000, wchar_t * * argv=0x005e1f10) Line 112	C++
 	firefox.exe!__tmainCRTStartup() Line 552	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown

I guess the fix here would be to initialize NS_StackWalk when not actually profiling, so that the main thread is still running while it happens.
Blocks: 967635
My suggested super-simple fix from the end of comment 0:

Put a call to EnsureWalkThreadReady (not exported from nsStackWalk.cpp, but not hidden either):
http://hg.mozilla.org/mozilla-central/annotate/199e65efd08b/xpcom/base/nsStackWalk.cpp#l259

in SamplerThread::Run:
http://hg.mozilla.org/mozilla-central/annotate/199e65efd08b/tools/profiler/platform-win32.cc#l110
That sounds reasonable. I'll try that and push to try to test. (I can't reproduce it locally)
Attempt #1 failed: https://tbpl.mozilla.org/?tree=Try&rev=79ac64d548e0
(just documenting this here)
Looks like we have a new deadlock. Ted, do you have a way to reproduce this bug? Can you confirm that my patch fixes the specific deadlock reported here? This is the stack trace for the new deadlock:

Main Thread
	ntdll.dll!_RtlpCreateUCREntry@24()	Unknown
 	ntdll.dll!_RtlpFindAndCommitPages@8()	Unknown
 	ntdll.dll!_RtlpExtendHeap@8()	Unknown
 	ntdll.dll!@RtlpAllocateHeap@24()	Unknown
 	ntdll.dll!_RtlAllocateHeap@12()	Unknown
 	ntdll.dll!_RtlpDosPathNameToRelativeNtPathName_Ustr@20()	Unknown
 	ntdll.dll!_RtlpDosPathNameToRelativeNtPathName_U@20()	Unknown
 	ntdll.dll!_RtlDosPathNameToRelativeNtPathName_U_WithStatus@16()	Unknown
 	KernelBase.dll!_GetFileAttributesExW@12()	Unknown
 	xul.dll!GetFileInfo(const nsString & name, PRFileInfo64 * info) Line 695	C++
 	xul.dll!nsLocalFile::ResolveAndStat() Line 1058	C++
 	xul.dll!nsLocalFile::GetTarget(nsAString_internal & _retval) Line 3073	C++
 	xul.dll!nsDirEnumerator::Init(nsIFile * parent) Line 860	C++
 	xul.dll!nsLocalFile::GetDirectoryEntries(nsISimpleEnumerator * * entries) Line 3118	C++
 	xul.dll!nsXREDirProvider::LoadAppBundleDirs() Line 624	C++
 	xul.dll!nsXREDirProvider::Initialize(nsIFile * aXULAppDir, nsIFile * aGREDir, nsIDirectoryServiceProvider * aAppProvider) Line 114	C++
 	xul.dll!XREMain::XRE_mainInit(bool * aExitFlag) Line 3002	C++
 	xul.dll!XREMain::XRE_main(int argc, char * * argv, const nsXREAppData * aAppData) Line 4055	C++
 	xul.dll!XRE_main(int argc, char * * argv, const nsXREAppData * aAppData, unsigned int aFlags) Line 4296	C++
 	firefox.exe!do_main(int argc, char * * argv, nsIFile * xreDirectory) Line 282	C++
 	firefox.exe!NS_internal_main(int argc, char * * argv) Line 643	C++
 	firefox.exe!wmain(int argc, wchar_t * * argv) Line 112	C++
 	firefox.exe!__tmainCRTStartup() Line 552	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown

msvcr100.dll!_threadstartex
	ntdll.dll!_KiFastSystemCallRet@0()	Unknown
 	ntdll.dll!_ZwWaitForSingleObject@12()	Unknown
 	ntdll.dll!_RtlpWaitOnCriticalSection@8()	Unknown
 	ntdll.dll!_RtlEnterCriticalSection@4()	Unknown
 	ntdll.dll!_LdrpLoadDll@24()	Unknown
 	ntdll.dll!_LdrLoadDll@16()	Unknown
 	mozglue.dll!`anonymous namespace'::patched_LdrLoadDll(wchar_t * filePath, unsigned long * flags, _UNICODE_STRING * moduleFileName, void * * handle) Line 599	C++
 	KernelBase.dll!_LoadLibraryExW@12()	Unknown
 	KernelBase.dll!_LoadLibraryExA@12()	Unknown
 	kernel32.dll!_LoadLibraryA@4()	Unknown
 	xul.dll!__delayLoadHelper2(const ImgDelayDescr * pidd, int (void) * * ppfnIATEntry) Line 305	C++
 	xul.dll!__tailMerge_dbghelp_dll()	Unknown
 	xul.dll!NS_StackWalk(void (void *, void *, void *) * aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned int aThread, void * aPlatformData) Line 559	C++
 	xul.dll!TableTicker::doNativeBacktrace(ThreadProfile & aProfile, TickSample * aSample) Line 470	C++
 	xul.dll!TableTicker::InplaceTick(TickSample * sample) Line 578	C++
 	xul.dll!SamplerThread::SampleContext(Sampler * sampler, ThreadProfile * thread_profile) Line 187	C++
 	xul.dll!SamplerThread::Run() Line 131	C++
 	xul.dll!ThreadEntry(void * arg) Line 232	C++
 	msvcr100.dll!_callthreadstartex() Line 314	C
 	msvcr100.dll!_threadstartex(void * ptd) Line 292	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown

msvcr100.dll!_threadstartex
	ntdll.dll!_KiFastSystemCallRet@0()	Unknown
 	user32.dll!_NtUserGetMessage@16()	Unknown
 	user32.dll!_GetMessageW@16()	Unknown
 	xul.dll!WalkStackThread(void * aData) Line 443	C++
 	msvcr100.dll!_callthreadstartex() Line 314	C
 	msvcr100.dll!_threadstartex(void * ptd) Line 292	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown

xul.dll thread
	ntdll.dll!_KiFastSystemCallRet@0()	Unknown
 	ntdll.dll!_ZwWaitForSingleObject@12()	Unknown
 	ntdll.dll!_RtlpWaitOnCriticalSection@8()	Unknown
 	ntdll.dll!_RtlEnterCriticalSection@4()	Unknown
 	ntdll.dll!@RtlpAllocateHeap@24()	Unknown
 	ntdll.dll!_RtlAllocateHeap@12()	Unknown
 	ntdll.dll!_RtlProcessFlsData@4()	Unknown
 	KernelBase.dll!_FlsSetValue@8()	Unknown
 	msvcr100.dll!__CRTDLL_INIT(void * dwReason, unsigned long lpreserved, void *) Line 377	C
 	msvcr100.dll!_CRTDLL_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 217	C
 	ntdll.dll!_LdrpCallInitRoutine@16()	Unknown
 	ntdll.dll!_LdrpInitializeThread@4()	Unknown
 	ntdll.dll!__LdrpInitialize@8()	Unknown
 	ntdll.dll!_LdrInitializeThunk@8()	Unknown
Flags: needinfo?(ted)
I don't have access to my Windows machine right now, I'll see if I can reproduce when I'm home next week. I believe I just had the profiler installed and set to profile startup, and it reproduced fairly frequently in that configuration.
From your stack it looks like the issue is that we delayload DbgHelp.dll:
http://mxr.mozilla.org/mozilla-central/source/toolkit/library/moz.build#72

so the first call to a function from dbghelp:
http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#557

winds up actually loading the DLL. I think if you swap the call to "EnsureWalkThreadReady" for "EnsureSymInitialized":
http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#724

you'd accomplish both things, since that calls SymInitialize as well.
Assignee: nobody → vstanchev
https://tbpl.mozilla.org/?tree=Try&rev=aa32b9d6fca7

Enabling profiling on Talos used to trigger this, so that's why I'm testing it with my custom talos and profiling enabled. Please disregard the few unrelated commits. They they don't touch Windows code.
Attachment #8399631 - Flags: review?(ted)
Comment on attachment 8399631 [details] [diff] [review]
Loader lock deadlock on startup when initializing NS_StackWalk

Review of attachment 8399631 [details] [diff] [review]:
-----------------------------------------------------------------

This looks sensible, although I'm not a peer for any of this code.

::: xpcom/base/nsStackWalk.cpp
@@ +717,5 @@
>  
>      return retval;
>  }
>  
> +EXPORT_XPCOM_API(nsresult)

I would just leave this as bool, honestly.

@@ +879,5 @@
>  
> +EXPORT_XPCOM_API(nsresult)
> +NS_EnsureSymInitialized()
> +{
> +	return NS_OK;

nit: you've got tabs here
Attachment #8399631 - Flags: review?(ted) → review+
Flags: needinfo?(ted)
Attachment #8399631 - Flags: review?(dbaron)
Comment on attachment 8399631 [details] [diff] [review]
Loader lock deadlock on startup when initializing NS_StackWalk

Could you make the name of the function NS_EnsureStackWalkInitialized,
which will make more sense to external callers?

Also, please the leave the return value a boolean; there's no need
to change it to nsresult, which just complicates things.  Though this
bit:
>     if (gInitialized)
>-        return gInitialized;
>+        return NS_OK;
can certainly be "return true" rather than "return gInitialized".

And fix the indentation in your two new empty functions to use
four spaces rather than a tab.

r=dbaron with that
Attachment #8399631 - Flags: review?(dbaron) → review+
Changed interface to bool, fixed tabs and changed some comments

https://tbpl.mozilla.org/?tree=Try&rev=8c2075cc2744
Attachment #8399631 - Attachment is obsolete: true
Looks like we need nsStackWalk.h to compile in C and C doesn't have a bool. Should I switch back to using nsresult?
Flags: needinfo?(dbaron)
Use PRBool.
Flags: needinfo?(dbaron)
https://tbpl.mozilla.org/?tree=Try&rev=66e7e880c0a8

This is the patch with PRBool.

I ran into an issue while testing it. It looks like it breaks the test tools/profiler/tests/test_enterjit_osr.js

I confirmed that commenting out NS_EnsureStackWalkInitialized() in platform-win32.cc makes the test pass. I have no idea why only this test breaks, but the results are consistent.

I wasn't able to reproduce any issues outside the test.

jld, I see that you added this test. Do you have any ideas about what might be happening here?
Attachment #8407065 - Attachment is obsolete: true
Flags: needinfo?(jld)
This is the assertion that fails:

>    const ms = 5;
>    p.StartProfiler(100, ms, ["js"], 1);
>    let profile = (function arbitrary_name(){
>        // A frame for |arbitrary_name| has been pushed.
>        let then = Date.now();
>        do {
>            let n = 10000;
>            while (--n); // OSR happens here
>            // Spin until we're sure we have a sample.
>        } while (Date.now() - then < ms * 2.5);
>        return p.getProfileData().threads[0].samples;
>    })();
>    do_check_neq(profile.length, 0);

What I was trying to do here is (1) enable the profiler, (2) push a pseudostack frame, (3) do on-stack replacement from the interpreter into Baseline, and (4) spin until the profiler has sampled at least once in the on-stack-replaced Baseline jitcode.

It seems that setting the profiler to a 5ms period and waiting until 12.5ms (2.5 periods) has elapsed isn't enough to ensure that a sample has occurred.  Maybe that should loop until the profile has a certain number of samples, instead?
Flags: needinfo?(jld)
Blocks: 1329181
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: