Closed
Bug 978585
Opened 10 years ago
Closed 2 years ago
Loader lock deadlock on startup when initializing NS_StackWalk
Categories
(Core :: Gecko Profiler, defect)
Tracking
()
RESOLVED
INACTIVE
People
(Reporter: ted, Assigned: vikstrous)
References
(Blocks 1 open bug)
Details
Attachments
(1 file, 2 obsolete files)
I hit a deadlock on startup, there were only two threads running: the main thread and the profiler thread. The main thread was inside a call to GetProcAddress, and the profiler thread was inside NS_StackWalk->EnsureWalkThreadReady, which was trying to call GetModuleHandle and waiting on the loader lock. The profiler thread had paused the main thread because it was profiling it, thus causing a deadlock. Here are the stacks: ntdll.dll!_ZwWaitForSingleObject@12() Unknown ntdll.dll!_RtlpWaitOnCriticalSection@8() Unknown ntdll.dll!_RtlEnterCriticalSection@4() Unknown ntdll.dll!_LdrGetDllHandleEx@20() Unknown ntdll.dll!_LdrGetDllHandle@16() Unknown KernelBase.dll!_GetModuleHandleForUnicodeString@4() Unknown KernelBase.dll!_BasepGetModuleHandleExW@16() Unknown KernelBase.dll!_GetModuleHandleW@4() Unknown msvcr100.dll!_initptd(_tiddata * ptd=0x005e7fe8, threadlocaleinfostruct * ptloci=0x005e7de8) Line 384 C msvcr100.dll!_beginthreadex(void * security=0x00000000, unsigned int stacksize=0x00000000, unsigned int (void *) * initialcode=0x56f1ea3d, void * argument=0x00000120, unsigned int createflag=0x00000000, unsigned int * thrdaddr=0x038fb460) Line 161 C xul.dll!EnsureWalkThreadReady() Line 274 C++ xul.dll!NS_StackWalk(void (void *, void *, void *) * aCallback=0x56c57fa3, unsigned int aSkipFrames=0x00000000, unsigned int aMaxFrames=0x000003e7, void * aClosure=0x038ff428, unsigned int aThread=0x00000024, void * aPlatformData=0x038ff4e8) Line 494 C++ xul.dll!TableTicker::doNativeBacktrace(ThreadProfile & aProfile={...}, TickSample * aSample=0x038ff4b0) Line 473 C++ xul.dll!TableTicker::InplaceTick(TickSample * sample=0x038ff4b0) Line 581 C++ > xul.dll!SamplerThread::SampleContext(Sampler * sampler=0x00c17280, ThreadProfile * thread_profile=0x00c172c0) Line 181 C++ xul.dll!SamplerThread::Run() Line 123 C++ xul.dll!`anonymous namespace'::ThreadFunc(void * closure=0x00c0e940) Line 27 C++ msvcr100.dll!_callthreadstartex() Line 314 C msvcr100.dll!_threadstartex(void * ptd=0x005e7fe8) Line 292 C kernel32.dll!@BaseThreadInitThunk@12() Unknown ntdll.dll!___RtlUserThreadStart@8() Unknown ntdll.dll!__RtlUserThreadStart@8() Unknown ntdll.dll!_RtlImageNtHeaderEx@20() Unknown ntdll.dll!_RtlpImageDirectoryEntryToDataEx@20() Unknown ntdll.dll!_RtlImageDirectoryEntryToData@16() Unknown ntdll.dll!_LdrGetProcedureAddressEx@20() Unknown ntdll.dll!_LdrGetProcedureAddress@16() Unknown KernelBase.dll!_GetProcAddress@8() Unknown > xul.dll!SetupErrorHandling(const char * progname=0x5665d0f9) Line 4416 C++ xul.dll!XREMain::XRE_mainInit(bool * aExitFlag=0x002ef0bf) Line 2875 C++ xul.dll!XREMain::XRE_main(int argc=0x00000001, char * * argv=0x00627bc8, const nsXREAppData * aAppData=0x002ef20c) Line 4043 C++ xul.dll!XRE_main(int argc=0x00000001, char * * argv=0x00627bc8, const nsXREAppData * aAppData=0x002ef20c, unsigned int aFlags=0x00000000) Line 4284 C++ firefox.exe!do_main(int argc=0x00000001, char * * argv=0x002eee00, nsIFile * xreDirectory=0x00c18100) Line 282 C++ firefox.exe!NS_internal_main(int argc=0x00000001, char * * argv=0x00627bc8) Line 643 C++ firefox.exe!wmain(int argc=0x00000000, wchar_t * * argv=0x005e1f10) Line 112 C++ firefox.exe!__tmainCRTStartup() Line 552 C kernel32.dll!@BaseThreadInitThunk@12() Unknown ntdll.dll!___RtlUserThreadStart@8() Unknown ntdll.dll!__RtlUserThreadStart@8() Unknown I guess the fix here would be to initialize NS_StackWalk when not actually profiling, so that the main thread is still running while it happens.
Reporter | ||
Comment 1•10 years ago
|
||
My suggested super-simple fix from the end of comment 0: Put a call to EnsureWalkThreadReady (not exported from nsStackWalk.cpp, but not hidden either): http://hg.mozilla.org/mozilla-central/annotate/199e65efd08b/xpcom/base/nsStackWalk.cpp#l259 in SamplerThread::Run: http://hg.mozilla.org/mozilla-central/annotate/199e65efd08b/tools/profiler/platform-win32.cc#l110
Assignee | ||
Comment 2•10 years ago
|
||
That sounds reasonable. I'll try that and push to try to test. (I can't reproduce it locally)
Assignee | ||
Comment 3•10 years ago
|
||
Attempt #1 failed: https://tbpl.mozilla.org/?tree=Try&rev=79ac64d548e0 (just documenting this here)
Assignee | ||
Comment 4•10 years ago
|
||
Looks like we have a new deadlock. Ted, do you have a way to reproduce this bug? Can you confirm that my patch fixes the specific deadlock reported here? This is the stack trace for the new deadlock: Main Thread ntdll.dll!_RtlpCreateUCREntry@24() Unknown ntdll.dll!_RtlpFindAndCommitPages@8() Unknown ntdll.dll!_RtlpExtendHeap@8() Unknown ntdll.dll!@RtlpAllocateHeap@24() Unknown ntdll.dll!_RtlAllocateHeap@12() Unknown ntdll.dll!_RtlpDosPathNameToRelativeNtPathName_Ustr@20() Unknown ntdll.dll!_RtlpDosPathNameToRelativeNtPathName_U@20() Unknown ntdll.dll!_RtlDosPathNameToRelativeNtPathName_U_WithStatus@16() Unknown KernelBase.dll!_GetFileAttributesExW@12() Unknown xul.dll!GetFileInfo(const nsString & name, PRFileInfo64 * info) Line 695 C++ xul.dll!nsLocalFile::ResolveAndStat() Line 1058 C++ xul.dll!nsLocalFile::GetTarget(nsAString_internal & _retval) Line 3073 C++ xul.dll!nsDirEnumerator::Init(nsIFile * parent) Line 860 C++ xul.dll!nsLocalFile::GetDirectoryEntries(nsISimpleEnumerator * * entries) Line 3118 C++ xul.dll!nsXREDirProvider::LoadAppBundleDirs() Line 624 C++ xul.dll!nsXREDirProvider::Initialize(nsIFile * aXULAppDir, nsIFile * aGREDir, nsIDirectoryServiceProvider * aAppProvider) Line 114 C++ xul.dll!XREMain::XRE_mainInit(bool * aExitFlag) Line 3002 C++ xul.dll!XREMain::XRE_main(int argc, char * * argv, const nsXREAppData * aAppData) Line 4055 C++ xul.dll!XRE_main(int argc, char * * argv, const nsXREAppData * aAppData, unsigned int aFlags) Line 4296 C++ firefox.exe!do_main(int argc, char * * argv, nsIFile * xreDirectory) Line 282 C++ firefox.exe!NS_internal_main(int argc, char * * argv) Line 643 C++ firefox.exe!wmain(int argc, wchar_t * * argv) Line 112 C++ firefox.exe!__tmainCRTStartup() Line 552 C kernel32.dll!@BaseThreadInitThunk@12() Unknown ntdll.dll!___RtlUserThreadStart@8() Unknown ntdll.dll!__RtlUserThreadStart@8() Unknown msvcr100.dll!_threadstartex ntdll.dll!_KiFastSystemCallRet@0() Unknown ntdll.dll!_ZwWaitForSingleObject@12() Unknown ntdll.dll!_RtlpWaitOnCriticalSection@8() Unknown ntdll.dll!_RtlEnterCriticalSection@4() Unknown ntdll.dll!_LdrpLoadDll@24() Unknown ntdll.dll!_LdrLoadDll@16() Unknown mozglue.dll!`anonymous namespace'::patched_LdrLoadDll(wchar_t * filePath, unsigned long * flags, _UNICODE_STRING * moduleFileName, void * * handle) Line 599 C++ KernelBase.dll!_LoadLibraryExW@12() Unknown KernelBase.dll!_LoadLibraryExA@12() Unknown kernel32.dll!_LoadLibraryA@4() Unknown xul.dll!__delayLoadHelper2(const ImgDelayDescr * pidd, int (void) * * ppfnIATEntry) Line 305 C++ xul.dll!__tailMerge_dbghelp_dll() Unknown xul.dll!NS_StackWalk(void (void *, void *, void *) * aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned int aThread, void * aPlatformData) Line 559 C++ xul.dll!TableTicker::doNativeBacktrace(ThreadProfile & aProfile, TickSample * aSample) Line 470 C++ xul.dll!TableTicker::InplaceTick(TickSample * sample) Line 578 C++ xul.dll!SamplerThread::SampleContext(Sampler * sampler, ThreadProfile * thread_profile) Line 187 C++ xul.dll!SamplerThread::Run() Line 131 C++ xul.dll!ThreadEntry(void * arg) Line 232 C++ msvcr100.dll!_callthreadstartex() Line 314 C msvcr100.dll!_threadstartex(void * ptd) Line 292 C kernel32.dll!@BaseThreadInitThunk@12() Unknown ntdll.dll!___RtlUserThreadStart@8() Unknown ntdll.dll!__RtlUserThreadStart@8() Unknown msvcr100.dll!_threadstartex ntdll.dll!_KiFastSystemCallRet@0() Unknown user32.dll!_NtUserGetMessage@16() Unknown user32.dll!_GetMessageW@16() Unknown xul.dll!WalkStackThread(void * aData) Line 443 C++ msvcr100.dll!_callthreadstartex() Line 314 C msvcr100.dll!_threadstartex(void * ptd) Line 292 C kernel32.dll!@BaseThreadInitThunk@12() Unknown ntdll.dll!___RtlUserThreadStart@8() Unknown ntdll.dll!__RtlUserThreadStart@8() Unknown xul.dll thread ntdll.dll!_KiFastSystemCallRet@0() Unknown ntdll.dll!_ZwWaitForSingleObject@12() Unknown ntdll.dll!_RtlpWaitOnCriticalSection@8() Unknown ntdll.dll!_RtlEnterCriticalSection@4() Unknown ntdll.dll!@RtlpAllocateHeap@24() Unknown ntdll.dll!_RtlAllocateHeap@12() Unknown ntdll.dll!_RtlProcessFlsData@4() Unknown KernelBase.dll!_FlsSetValue@8() Unknown msvcr100.dll!__CRTDLL_INIT(void * dwReason, unsigned long lpreserved, void *) Line 377 C msvcr100.dll!_CRTDLL_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 217 C ntdll.dll!_LdrpCallInitRoutine@16() Unknown ntdll.dll!_LdrpInitializeThread@4() Unknown ntdll.dll!__LdrpInitialize@8() Unknown ntdll.dll!_LdrInitializeThunk@8() Unknown
Flags: needinfo?(ted)
Reporter | ||
Comment 5•10 years ago
|
||
I don't have access to my Windows machine right now, I'll see if I can reproduce when I'm home next week. I believe I just had the profiler installed and set to profile startup, and it reproduced fairly frequently in that configuration.
Reporter | ||
Comment 6•10 years ago
|
||
From your stack it looks like the issue is that we delayload DbgHelp.dll: http://mxr.mozilla.org/mozilla-central/source/toolkit/library/moz.build#72 so the first call to a function from dbghelp: http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#557 winds up actually loading the DLL. I think if you swap the call to "EnsureWalkThreadReady" for "EnsureSymInitialized": http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#724 you'd accomplish both things, since that calls SymInitialize as well.
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → vstanchev
Assignee | ||
Comment 7•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=aa32b9d6fca7 Enabling profiling on Talos used to trigger this, so that's why I'm testing it with my custom talos and profiling enabled. Please disregard the few unrelated commits. They they don't touch Windows code.
Attachment #8399631 -
Flags: review?(ted)
Reporter | ||
Comment 8•10 years ago
|
||
Comment on attachment 8399631 [details] [diff] [review] Loader lock deadlock on startup when initializing NS_StackWalk Review of attachment 8399631 [details] [diff] [review]: ----------------------------------------------------------------- This looks sensible, although I'm not a peer for any of this code. ::: xpcom/base/nsStackWalk.cpp @@ +717,5 @@ > > return retval; > } > > +EXPORT_XPCOM_API(nsresult) I would just leave this as bool, honestly. @@ +879,5 @@ > > +EXPORT_XPCOM_API(nsresult) > +NS_EnsureSymInitialized() > +{ > + return NS_OK; nit: you've got tabs here
Attachment #8399631 -
Flags: review?(ted) → review+
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(ted)
Assignee | ||
Updated•10 years ago
|
Attachment #8399631 -
Flags: review?(dbaron)
Comment on attachment 8399631 [details] [diff] [review] Loader lock deadlock on startup when initializing NS_StackWalk Could you make the name of the function NS_EnsureStackWalkInitialized, which will make more sense to external callers? Also, please the leave the return value a boolean; there's no need to change it to nsresult, which just complicates things. Though this bit: > if (gInitialized) >- return gInitialized; >+ return NS_OK; can certainly be "return true" rather than "return gInitialized". And fix the indentation in your two new empty functions to use four spaces rather than a tab. r=dbaron with that
Attachment #8399631 -
Flags: review?(dbaron) → review+
Assignee | ||
Comment 10•10 years ago
|
||
Changed interface to bool, fixed tabs and changed some comments https://tbpl.mozilla.org/?tree=Try&rev=8c2075cc2744
Attachment #8399631 -
Attachment is obsolete: true
Assignee | ||
Comment 11•10 years ago
|
||
Looks like we need nsStackWalk.h to compile in C and C doesn't have a bool. Should I switch back to using nsresult?
Flags: needinfo?(dbaron)
Use PRBool.
Flags: needinfo?(dbaron)
Assignee | ||
Comment 13•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=66e7e880c0a8 This is the patch with PRBool. I ran into an issue while testing it. It looks like it breaks the test tools/profiler/tests/test_enterjit_osr.js I confirmed that commenting out NS_EnsureStackWalkInitialized() in platform-win32.cc makes the test pass. I have no idea why only this test breaks, but the results are consistent. I wasn't able to reproduce any issues outside the test. jld, I see that you added this test. Do you have any ideas about what might be happening here?
Attachment #8407065 -
Attachment is obsolete: true
Flags: needinfo?(jld)
Comment 14•10 years ago
|
||
This is the assertion that fails:
> const ms = 5;
> p.StartProfiler(100, ms, ["js"], 1);
> let profile = (function arbitrary_name(){
> // A frame for |arbitrary_name| has been pushed.
> let then = Date.now();
> do {
> let n = 10000;
> while (--n); // OSR happens here
> // Spin until we're sure we have a sample.
> } while (Date.now() - then < ms * 2.5);
> return p.getProfileData().threads[0].samples;
> })();
> do_check_neq(profile.length, 0);
What I was trying to do here is (1) enable the profiler, (2) push a pseudostack frame, (3) do on-stack replacement from the interpreter into Baseline, and (4) spin until the profiler has sampled at least once in the on-stack-replaced Baseline jitcode.
It seems that setting the profiler to a 5ms period and waiting until 12.5ms (2.5 periods) has elapsed isn't enough to ensure that a sample has occurred. Maybe that should loop until the profile has a certain number of samples, instead?
Flags: needinfo?(jld)
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INACTIVE
You need to log in
before you can comment on or make changes to this bug.
Description
•