Closed Bug 1241684 Opened 8 years ago Closed 7 years ago

Reduce timeout in Windows stackwalk initialization code to speed up DMD start-up

Categories

(Core :: mozglue, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

Attachments

(2 files, 1 obsolete file)

DMD is really slow to start up on Windows. Emanuel identified why more than a
year ago
(https://blog.mozilla.org/nnethercote/2014/12/11/cumulative-heap-profiling-in-firefox-with-dmd/comment-page-1/#comment-44068)
and I just got around to confirming it.

In EnsureWalkThreadReady(), we have this code:

> // The thread was created. Try to wait an arbitrary amount of time (1 second
> // should be enough) for its event loop to start before posting events to it.
> DWORD waitRet = ::WaitForSingleObject(readyEvent, 1000);
> if (waitRet == WAIT_TIMEOUT) {
>   // We get a timeout if we're called during static initialization because
>   // the thread will only start executing after we return so it couldn't
>   // have signalled the event. If that is the case, give up for now and
>   // try again next time we're called.
>   return false;
> }

I think this code was written with the assumption that stack-walking is a
relatively rare operation, but DMD does stack-walks a *lot*. AFAICT, this means
that the "timeout occurs during static initialization" case gets hit a lot, and
every time it does there's a 1 second penalty.

I tried reducing the timeout and it made a huge difference to how long it takes
for the first tab to render. Especially when I tell DMD to record every
allocation instead of using sampling. The following table has the numbers.

> timeout         time (sampling)         time (no sampling)
> -------         ---------------         ------------------
> 1000ms          50s                     ? (too slow)
>  100ms           9s                     180s
>   10ms           5s                      39s
>    0ms           4s                      18s

Running without sampling, which is something that is often useful, is basically
impossible with the current timeout length. So I'd like to reduce it as much as
possible.

I guess the possible consequence of this is that some calls to MozStackWalk()
might return false when they would otherwise return true with a longer timeout?
If that's a problem, maybe we could use 1000 for the first timeout and then
reduce it to 0 for any subsequent ones?

(Now that I think about it, I suspect EnsureWalkThreadReady() is not
threadsafe. It looks like multiple calls to it could cause multiple walker
threads to be created, but only one of them end up stored in |stackWalkThread|?)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Attached patch Convert MozStackwalk to use APCs (obsolete) — Splinter Review
I am not trying to upstage you, Nick, but I've had a patch sitting around that does away with that wait in its entirety. Unlike with Windows UI messaging, where the message queue needs to be created explicitly (and thus we need to wait for that to happen), APC queues are implicitly created at the same time as the thread. This patch converts MozStackwalk to use APCs instead of UI messages.

Furthermore, this allows us to remove the dependency on user32.dll.
Attachment #8710835 - Flags: feedback?(ehsan)
I'm very happy to be upstaged... except your patch doesn't work for me :(  DMD hangs at startup for at least 90 seconds (I gave up after that long).

To debug it, follow the "build" and "run" instructions for DMD at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD#Desktop_Firefox_%28Windows%29.
Comment on attachment 8710745 [details] [diff] [review]
Reduce timeout in Windows stackwalk initialization code to speed up DMD start-up

Review of attachment 8710745 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/misc/StackWalk.cpp
@@ +302,5 @@
>    MOZ_ASSERT((stackWalkThread && readyEvent) ||
>               (!stackWalkThread && !readyEvent));
>  
> +  // The thread was created. Spin the event loop before posting events to it.
> +  DWORD waitRet = ::WaitForSingleObject(readyEvent, 0);

This is wrong.  Passing 0 as the second argument will change what this function does completely: it will only tell you the current state of the event object (whether it's signaled or not) and will return immediately, always.  So essentially you're destroying the wait mechanism!

If you want to pursue this, I prefer changing things that we only wait on this event if the first call to PostThreadMessage() below fails.  After all, that's the only time when we definitely need to wait.

The other thing that would be interesting to figure out is why it is that we're waiting for one second here in the first place.  I'm not sure if I understand that, but I would really like to...
Attachment #8710745 - Flags: feedback?(ehsan) → feedback-
Comment on attachment 8710835 [details] [diff] [review]
Convert MozStackwalk to use APCs

Review of attachment 8710835 [details] [diff] [review]:
-----------------------------------------------------------------

I like this approach a lot better.  But let's see why this doesn't work for Nick first.  f+ for now!
Attachment #8710835 - Flags: feedback?(ehsan) → feedback+
(In reply to Nicholas Nethercote [:njn] from comment #0)
> DMD is really slow to start up on Windows. Emanuel identified why more than a
> year ago
> (https://blog.mozilla.org/nnethercote/2014/12/11/cumulative-heap-profiling-
> in-firefox-with-dmd/comment-page-1/#comment-44068)
> and I just got around to confirming it.

I don't understand what's happening at all, FWIW.  :-)

> In EnsureWalkThreadReady(), we have this code:
> 
> > // The thread was created. Try to wait an arbitrary amount of time (1 second
> > // should be enough) for its event loop to start before posting events to it.
> > DWORD waitRet = ::WaitForSingleObject(readyEvent, 1000);
> > if (waitRet == WAIT_TIMEOUT) {
> >   // We get a timeout if we're called during static initialization because
> >   // the thread will only start executing after we return so it couldn't
> >   // have signalled the event. If that is the case, give up for now and
> >   // try again next time we're called.
> >   return false;
> > }
> 
> I think this code was written with the assumption that stack-walking is a
> relatively rare operation, but DMD does stack-walks a *lot*. AFAICT, this
> means
> that the "timeout occurs during static initialization" case gets hit a lot,
> and
> every time it does there's a 1 second penalty.

Which "timeout occurs during static initialization" case?

> (Now that I think about it, I suspect EnsureWalkThreadReady() is not
> threadsafe. It looks like multiple calls to it could cause multiple walker
> threads to be created, but only one of them end up stored in
> |stackWalkThread|?)

Any usage of NS_StackWalk from anything other than the main thread is unsafe.  Is that what DMD does?
Flags: needinfo?(n.nethercote)
> > > // The thread was created. Try to wait an arbitrary amount of time (1 second
> > > // should be enough) for its event loop to start before posting events to it.
> > > DWORD waitRet = ::WaitForSingleObject(readyEvent, 1000);
> > > if (waitRet == WAIT_TIMEOUT) {
> > >   // We get a timeout if we're called during static initialization because
> > >   // the thread will only start executing after we return so it couldn't
> > >   // have signalled the event. If that is the case, give up for now and
> > >   // try again next time we're called.
> > >   return false;
> > > }
> > 
> > I think this code was written with the assumption that stack-walking is a
> > relatively rare operation, but DMD does stack-walks a *lot*. AFAICT, this
> > means
> > that the "timeout occurs during static initialization" case gets hit a lot,
> > and
> > every time it does there's a 1 second penalty.
> 
> Which "timeout occurs during static initialization" case?

The |return false| case with the comment that says "We get a timeout...". DMD does a stack walk for some fraction (possible all) calls to heap allocation functions like malloc. I confirmed that this case is being hit, and the number of times it gets hit correlates perfectly with the delay I get at startup.

> Any usage of NS_StackWalk from anything other than the main thread is
> unsafe.  Is that what DMD does?

Yes, and with abandon! If you turn sampling off we call it for every call to malloc/calloc/realloc/etc. This doesn't seem to have caused problems other than the one this bug is about...
Flags: needinfo?(n.nethercote)
The sampled stacks in bug 1154263 (probably unrelated?) also showed NS_StackWalk being called from both the HangMonitor thread and the (profiler's?) SamplingThread - see attachment 8592207 [details]. I think people very much expect this API to be thread safe.
The comment at the very top of StackWalk.h says:

> /* API for getting a stack trace of the C/C++ stack on the current thread */

Also, MozStackWalk has a |aThread| argument, the comment for which says:

> The thread for which the stack is to be retrieved. Passing null causes us to walk the stack of the current thread. On Windows, this is a thread HANDLE. It is currently not supported on any other platform."

Finally, StackWalk.cpp has stuff like this:

> // Debug routines are not threadsafe, so grab the lock.
> EnterCriticalSection(&gDbgHelpCS);

The whole module certainly gives the appearance of trying to be threadsafe.
(In reply to Nicholas Nethercote [:njn] from comment #3)
> I'm very happy to be upstaged... except your patch doesn't work for me :( 
> DMD hangs at startup for at least 90 seconds (I gave up after that long).
> 
> To debug it, follow the "build" and "run" instructions for DMD at
> https://developer.mozilla.org/en-US/docs/Mozilla/Performance/
> DMD#Desktop_Firefox_%28Windows%29.

During the startup hang, I see this:

Main Thread:

ntdll!NtSignalAndWaitForSingleObject+0xc
KERNELBASE!SignalObjectAndWait+0x92
dmd!MozStackWalk+0x2be
dmd!mozilla::dmd::StackTrace::Get+0xef
dmd!mozilla::dmd::AllocCallback+0x95
dmd!replace_malloc+0x6e
mozglue!malloc_impl+0x32
mozglue!moz_xmalloc+0xd
xul!operator new[]+0xd
xul!stagefright::List<stagefright::AString>::prep+0x10
xul!stagefright::List<stagefright::AString>::List<stagefright::AString>+0x18
xul!stagefright::splat_type<stagefright::List<stagefright::AString> >+0x47
xul!stagefright::Vector<stagefright::List<stagefright::AString> >::do_splat+0x18
xul!stagefright::VectorImpl::_do_splat+0x20
xul!stagefright::VectorImpl::insertAt+0x50
xul!stagefright::VectorImpl::push+0x1e
xul!stagefright::Vector<stagefright::List<stagefright::AString> >::push+0x13
xul!stagefright::AAtomizer::AAtomizer+0x4b
xul!`dynamic initializer for 'stagefright::AAtomizer::gAtomizer''+0xd
MSVCR120!_initterm+0x29
xul!_CRT_INIT+0x1b0
xul!__DllMainCRTStartup+0x69
xul!_DllMainCRTStartup+0x1c
ntdll!LdrxCallInitRoutine+0x16
ntdll!LdrpCallInitRoutine+0x43
ntdll!LdrpInitializeNode+0x101
ntdll!LdrpInitializeGraphRecurse+0x71
ntdll!LdrpPrepareModuleForExecution+0x8b
ntdll!LdrpLoadDllInternal+0x121
ntdll!LdrpLoadDll+0x92
ntdll!LdrLoadDll+0xd9
mozglue!`anonymous namespace'::patched_LdrLoadDll+0x73e
KERNELBASE!LoadLibraryExW+0x138
firefox!GetLibHandle+0x14
firefox!ReadDependentCB+0x23
firefox!ReadDependentCB+0x43
firefox!XPCOMGlueLoad+0x273
firefox!XPCOMGlueStartup+0x30
firefox!InitXPCOMGlue+0x115
firefox!NS_internal_main+0x7f
firefox!wmain+0x12e
firefox!__tmainCRTStartup+0xfe
KERNEL32!BaseThreadInitThunk+0x24
ntdll!__RtlUserThreadStart+0x2f
ntdll!_RtlUserThreadStart+0x1b

Stack walker thread:

ntdll!NtWaitForSingleObject+0xc
ntdll!LdrpDrainWorkQueue+0xbd
ntdll!LdrpInitializeThread+0x117
ntdll!_LdrpInitialize+0x30f
ntdll!LdrpInitialize+0x25
ntdll!LdrInitializeThunk+0x10

The problem is a deadlock on the loader lock: the first memory allocation is taking place during invocation of static constructors. The loader lock is held while this happens. Then we call MozStackWalk which is waiting on the stack walker thread to initialize. The stack walker thread can't initialize because it can't grab the loader lock!

In the case of DMD, we cannot rely on lazy initialization for the stack walker thread. I've modified the patch so that we can *optionally* prime that thread in advance so that it is ready for such scenarios.
Patch r2 still hangs for me at start-up with DMD is enabled :(
(In reply to Nicholas Nethercote [:njn] from comment #12)
> Patch r2 still hangs for me at start-up with DMD is enabled :(

Booo! But it works on my machine! ;-)

Can you please post stack traces of the main thread and the stack walker thread?
Flags: needinfo?(n.nethercote)
Comment on attachment 8716054 [details] [diff] [review]
Convert MozStackwalk to use APCs (r2)

Review of attachment 8716054 [details] [diff] [review]:
-----------------------------------------------------------------

Clearing the review until this works for njn...  Let me know if there's value in reviewing before that time.
Attachment #8716054 - Flags: review?(ehsan)
Sorry for the delayed response. I'm a total n00b at Windows debugging but I've done my best.

Note that when I start Firefox with DMD a window pops up, but it stays completely blank and the cursor spins continually.

I ran it in the Visual Studio debugger and stopped execution at that point. The "Main Thread" stack is as follows.

> [External Code]	
> xul.dll!mozilla::widget::WinUtils::WaitForMessage(unsigned long aTimeoutMs) Line 725	C++
> xul.dll!nsAppShell::ProcessNextNativeEvent(bool mayWait) Line 380	C++
> xul.dll!nsBaseAppShell::DoProcessNextNativeEvent(bool mayWait) Line 139	C++
> xul.dll!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal * thr, bool mayWait) Line 289	C++
> xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Line 990	C++
> xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Line 297	C++
> xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) Line 128	C++
> xul.dll!MessageLoop::RunHandler() Line 228	C++
> xul.dll!MessageLoop::Run() Line 202	C++
> xul.dll!nsBaseAppShell::Run() Line 158	C++
> xul.dll!nsAppShell::Run() Line 259	C++
> xul.dll!nsAppStartup::Run() Line 282	C++
> xul.dll!XREMain::XRE_mainRun() Line 4291	C++
> xul.dll!XREMain::XRE_main(int argc, char * * argv, const nsXREAppData * aAppData) Line 4380	C++
> xul.dll!XRE_main(int argc, char * * argv, const nsXREAppData * aAppData, unsigned int aFlags) Line 4482	C++
> firefox.exe!do_main(int argc, char * * argv, char * * envp, nsIFile * xreDirectory) Line 220	C++
> firefox.exe!NS_internal_main(int argc, char * * argv, char * * envp) Line 360	C++
> firefox.exe!wmain(int argc, wchar_t * * argv) Line 138	C++
> [External Code]	

The stack walker thread is called "msvcr120.dll" (I worked that out by luck). Its stack is as follows.

> [External Code]	
> dmd.dll!WalkStackThread(void * aData) Line 480	C++
> [External Code]	

Line 480 is the loop in WalkStackThread().

All that was with a 64-bit build.
Flags: needinfo?(n.nethercote)
That's a normal stack for the main thread -- it's just waiting for input.

Based on those stacks, I'd say that everything looks fine. Could you try again, but this time grab traces for both the WalkStackThread and whichever thread has called into MozStackWalk?
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #16)
> Could you try
> again, but this time grab traces for both the WalkStackThread and whichever
> thread has called into MozStackWalk?

I looked at every thread, and AFAICT none of them are in MozStackWalk when I interrupt the hang. The fact that the hang occurs only after a (blank) window has appeared indicates that execution has proceeded well past the first few allocations. So something else must be going wrong, but I don't know what it is :(
I have a partially-working browser. I can create new tabs and type into the address bar, but no websites will load. Visiting about:memory and taking a measurement works. This is weird.
Did you check what the content process is doing? It sounds like that's what's being handging.
(In reply to Mike Hommey [:glandium] from comment #19)
> Did you check what the content process is doing? It sounds like that's
> what's being handging.

I... really should have thought of that myself. Thank you for the suggestion.
This is interesting: if I disable e10s then everything runs fine. But with e10s
enabled the hang is in the content process.

When it hangs the main thread in the content process looks like this:

> [External Code]
> dmd.dll!MozStackWalk(void (unsigned int, void *, void *, void *) * aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned __int64 aThread, void * aPlatformData) Line 576C++
> dmd.dll!mozilla::dmd::StackTrace::Get(mozilla::dmd::Thread * aT) Line 787C++
> dmd.dll!mozilla::dmd::AllocCallback(void * aPtr, unsigned __int64 aReqSize, mozilla::dmd::Thread * aT) Line 1176C++
> dmd.dll!replace_malloc(unsigned __int64 aSize) Line 1257C++
> mozglue.dll!moz_xmalloc(unsigned __int64 size) Line 84C++
> xul.dll!stagefright::AAtomizer::AAtomizer() Line 33C++
> xul.dll!`dynamic initializer for 'stagefright::AAtomizer::gAtomizer''() Line 24C++
> [External Code]
> plugin-container.exe!__delayLoadHelper2(const ImgDelayDescr * pidd, __int64 (void) * * ppfnIATEntry) Line 289C++
> [External Code]
> plugin-container.exe!content_process_main(int argc, char * * argv) Line 180C++
> plugin-container.exe!wmain(int argc, wchar_t * * argv) Line 138C++
> [External Code]

Line 576 is this one:

> if (walkerReturn != WAIT_OBJECT_0 && data.walkCallingThread) {

There's only one other thread in the content process. It's called
"msvcr120.dll" (which is the same name as the stack walking thread in the
parent process) and its stack just contains this:

> [External Code]

When I dig more deeply I just get a stack trace whose entries are all in ntdll.dll.
Do you have Microsoft symbols turned on in your debugger?  That should allow Visual Studio symbolicate the frames coming from Windows DLLs which may help us figure out what's going on.
(In reply to :Ehsan Akhgari from comment #22)
> Do you have Microsoft symbols turned on in your debugger?  That should allow
> Visual Studio symbolicate the frames coming from Windows DLLs which may help
> us figure out what's going on.

https://developer.mozilla.org/en-US/docs/Mozilla/Using_the_Mozilla_symbol_server

For reference
I found a different way to make DMD work on Win32 in bug 1382076 -- I changed it to use FramePointerStackWalk() instead of MozStackWalk().
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: