Closed Bug 1263595 Opened 8 years ago Closed 8 years ago

Hang at RtlAddFunctionTable()/RtlDeleteFunctionTable() in RegisterExecutableMemory() and DeallocateExecutableMemory()

Categories

(Core :: Gecko Profiler, defect)

x86_64
Windows 10
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox49 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox53 --- fixed

People

(Reporter: blassey, Assigned: away)

References

Details

(Keywords: hang, reproducible)

Attachments

(5 files)

For the last two weeks my browser has been hung every morning that I've returned to it. Thursday morning and this morning I've attached Visual Studio and found these two stacks

 	ntdll.dll!RtlAddFunctionTable()	Unknown
>	xul.dll!RegisterExecutableMemory(void * p, unsigned __int64 bytes, unsigned __int64 pageSize) Line 168	C++
 	xul.dll!js::jit::AllocateExecutableMemory(void * addr, unsigned __int64 bytes, unsigned int permissions, const char * pageSize, unsigned __int64) Line 196	C++
 	xul.dll!js::jit::ExecutableAllocator::systemAlloc(unsigned __int64 n) Line 229	C++
 	xul.dll!js::jit::ExecutableAllocator::createPool(unsigned __int64 n) Line 229	C++
 	xul.dll!js::jit::ExecutableAllocator::poolForSize(unsigned __int64 n) Line 165	C++
 	xul.dll!js::jit::ExecutableAllocator::alloc(unsigned __int64 n, js::jit::ExecutablePool * * poolp, js::jit::CodeKind type) Line 263	C++
 	xul.dll!js::jit::Linker::newCode<1>(JSContext * cx, js::jit::CodeKind kind, bool hasPatchableBackedges) Line 59	C++
 	xul.dll!js::jit::BaselineCompiler::compile() Line 132	C++
 	xul.dll!js::jit::BaselineCompile(JSContext * cx, JSScript * script, bool forceDebugInstrumentation) Line 291	C++
 	xul.dll!js::jit::CanEnterBaselineMethod(JSContext * cx, js::RunState & state) Line 392	C++
 	xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Line 412	C++
 	xul.dll!js::Invoke(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 497	C++
 	xul.dll!js::jit::DoCallFallback(JSContext * cx, js::jit::BaselineFrame * frame, js::jit::ICCall_Fallback * stub_, unsigned int argc, JS::Value * vp, JS::MutableHandle<JS::Value> res) Line 6115	C++
 	00000034dbf1a867()	Unknown
 	000002127ff6a800()	Unknown

 	[External Code]	
>	xul.dll!js::jit::DeallocateExecutableMemory(void * addr, unsigned __int64 pageSize, unsigned __int64) Line 220	C++
 	xul.dll!js::jit::ExecutableAllocator::releasePoolPages(js::jit::ExecutablePool * pool) Line 286	C++
 	xul.dll!js::jit::ExecutableAllocator::poisonCode(JSRuntime * rt, mozilla::Vector<js::jit::JitPoisonRange,0,js::SystemAllocPolicy> & ranges) Line 407	C++
 	xul.dll!js::FreeOp::~FreeOp() Line 763	C++
 	xul.dll!js::gc::GCRuntime::sweepPhase(js::SliceBudget & sliceBudget) Line 5571	C++
 	xul.dll!js::gc::GCRuntime::incrementalCollectSlice(js::SliceBudget & budget, JS::gcreason::Reason reason) Line 6134	C++
 	xul.dll!js::gc::GCRuntime::gcCycle(bool nonincrementalByAPI, js::SliceBudget & budget, JS::gcreason::Reason reason) Line 6360	C++
 	xul.dll!js::gc::GCRuntime::collect(bool nonincrementalByAPI, js::SliceBudget budget, JS::gcreason::Reason reason) Line 6468	C++
 	xul.dll!js::gc::GCRuntime::gcSlice(JS::gcreason::Reason reason, __int64 millis) Line 6541	C++
 	xul.dll!js::gc::GCRuntime::gcIfRequested(JSContext * cx) Line 6760	C++
 	xul.dll!js::Invoke(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 497	C++
 	xul.dll!js::Invoke(JSContext * cx, const JS::Value & thisv, const JS::Value & fval, unsigned int argc, const JS::Value * argv, JS::MutableHandle<JS::Value> rval) Line 528	C++
 	xul.dll!js::jit::InvokeFunction(JSContext * cx, JS::Handle<JSObject *> obj, bool constructing, unsigned int argc, JS::Value * argv, JS::MutableHandle<JS::Value> rval) Line 104	C++
 	[External Code]	

For the last Dealocate one I got the following disassembly.

        addr = (uint8_t*)addr - pageSize;
00007FFC4253055F  sub         rbx,r8  
        UnregisterExecutableMemory(addr, bytes, pageSize);
00007FFC42530562  mov         rcx,rbx  
00007FFC42530565  call        qword ptr [__imp_RtlDeleteFunctionTable (07FFC44210700h)]  
    }
#endif

    VirtualFree(addr, 0, MEM_RELEASE);
00007FFC4253056B  xor         edx,edx  
00007FFC4253056D  mov         r8d,8000h  
00007FFC42530573  mov         rcx,rbx  
}
00007FFC42530576  add         rsp,20h  
00007FFC4253057A  pop         rbx  
    }
#endif

    VirtualFree(addr, 0, MEM_RELEASE);
00007FFC4253057B  jmp         qword ptr [__imp_VirtualFree (07FFC44210A78h)]  
--- No source file -------------------------------------------------------------
00007FFC42530582  int         3  
00007FFC42530583  int         3  
--- c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\image\imagefactory.cpp 

/* static */ already_AddRefed<Image>
ImageFactory::CreateVectorImage(nsIRequest* aRequest,
                                ProgressTracker* aProgressTracker,
                                const nsCString& aMimeType,
                                ImageURL* aURI,
                                uint32_t aImageFlags,
                                uint32_t aInnerWindowId)
{
00007FFC42530584  mov         rax,rsp  
00007FFC42530587  mov         qword ptr [rax+8],rbx  
00007FFC4253058B  mov         qword ptr [rax+10h],rbp  
00007FFC4253058F  mov         qword ptr [rax+18h],rsi  
00007FFC42530593  mov         qword ptr [rax+20h],rdi  
00007FFC42530597  push        r14  
00007FFC42530599  sub         rsp,20h  
00007FFC4253059D  mov         rdi,rcx  
00007FFC425305A0  mov         rbp,r8
Luke, this seems to hang either in RtlAddFunctionTable or RtlDeleteFunctionTable. Any idea what's going on?
Flags: needinfo?(luke)
I should note that this is on Windows 10 64bit and the 64bit Nightly build.
OS: Unspecified → Windows 10
Hardware: Unspecified → x86_64
Just hit this again with less than a minute of uptime, so this doesn't look like sitting overnight is required (so probably not some resource exhaustion or churn problem).
Nothing sticks out. My guess would be some double free going on that puts the underlying Windows impl of those functions in an iloop.  It'd be great if we could get something reproducible to bisect.
Flags: needinfo?(luke)
This bug makes Nightly unusable for me. Is there something we can do to move forward here?
Flags: needinfo?(nihsanullah)
I'm also experiencing this intermittently and I think it might be triggered somewhat more frequently when running with the gecko profiler addon profiling. But even then it still takes hours, so take with a grain of salt.
Ah, now that I think of it, I may have very longwinded STR that can induce a hang promptly. But I don't recall if the hang was actually in the same signature and it's already a month old, so idk if it reproduces on current nightlies.

see bug 1160228 c28 for the steps involved.

It's probably possible to trim it down since it was originally designed to demonstrate a different issue.
Naveed, can you have someone test the above STRs?
Looking at it in blassey's debugger, one or the other of Rtl{Add,Delete}FunctionTable has some SRWLock-ful frames underneath it.  It looks to me like there's some sort of deadlock happening under the hood here, although by my limited understanding of that structure, I don't see how you'd get one with just the single lock in play.  But I confess I know ~zero about these Windows things, so I'm probably missing something.

FWIW regarding comment 2, all this code is win64-specific, because the exception-handling mechanism for 32-bit Windows is entirely different from that for 64-bit Windows.
On windows 64, there's a process-wide list of tables used for doing stack walks and exception unwinds. The profiler, and other stack walking mechanisms consult this table via RtlLookupFunctionEntry(). But, in order to get a sane profile, we don't want the profiled thread to run, so we call SuspendThread() to stop it from running so that we can do the stack walk.

So, the profiler's code looks like:

::SuspendThread()
...
mozglue::WalkStackMain64()
    RtlLookupFunctionEntry()

This system mostly works great! There's one snag: the jits have to edit this table list when they allocate or free executable memory. This happens in RegisterExecutableMemory() and DeallocateExecutableMemory(), by calling RtlAddFunctionTable() and RtlDeleteFunctionTable() respectively.

Now we have a problem.

What's going on here is:

The main thread calls Rtl{Add,Delete}FunctionTable(). Since the list is process-wide, it takes a library-internal lock. Then, it gets suspended by the profiler calling SuspendThread(), but it's still holding the lock! So, when the profiler does the stack walk, and calls RtlLookupFunctionEntry(), it cannot get the internal lock, because the main thread holds it. Since the lock is held, the profiling thread blocks, but the main thread is suspended, so it will never run again to release the lock, and we deadlock.

With the SPS addon taking stack samples many times per second, it's only a matter of time.

This issue has existed for a long time. Bug 1154263 was filed over a year ago, and bug 1252587 was filed 3 months ago. Kudos to Brad for dogfooding, finding, and making noise about this. Note that this bug ONLY affects windows 64 bit builds, because it's a deadlock in a win64 specific api. Other OSes and versions do not use this stack-walking mechanism.
Flags: needinfo?(nihsanullah)
Component: JavaScript Engine: JIT → Gecko Profiler
Whiteboard: [fce-active]
Attached patch FixSplinter Review
I'm pretty unhappy with this, but ti'll definitely work.

We don't really need one lock. This means that in practice if *any* thread in JS has that lock, then we skip the profile. What we really want is a lock per thread, and they can synchronize. Unfortunately, I can't figure out where to put that lock, since the JS engine doesn't really have a notion of external threads.

I'm open to any suggestions about how to improve this.
Assignee: nobody → efaustbmo
Status: NEW → ASSIGNED
Attachment #8761864 - Flags: review?(nfroyd)
Attachment #8761864 - Flags: review?(jdemooij)
Comment on attachment 8761864 [details] [diff] [review]
Fix

Brad, this is a slightly modified version of the patch that you said worked for you. If you want to test this again, feel free.
Attachment #8761864 - Flags: feedback?(blassey.bugs)
Attachment #8761864 - Flags: feedback?(blassey.bugs) → feedback+
Comment on attachment 8761864 [details] [diff] [review]
Fix

Review of attachment 8761864 [details] [diff] [review]:
-----------------------------------------------------------------

Eric said he was going to try an alternate approach.  I'm surprised that this touches so many places. :(

No really substantial comments since we're going to have a completely different patch.

::: tools/profiler/core/platform-win32.cc
@@ +203,5 @@
> +      if (!TryAcquireStackWalkWorkaroundLock()) {
> +        ResumeThread(profiled_thread);
> +        return;
> +      }
> +      ReleaseStackWalkWorkaroundLock();

I don't get the logic here, even after suggesting the approach and talking it over.  We attempt to get the lock, and if we can't, we return.  That makes sense.  But then, if we get the lock, we immediately release it without doing any intervening work?  That seems weird.

We're trying to treat the lock as a barrier, then?  But what happens if somebody grabs the lock immediately after we release it?  Why is that not dangerous?
Attachment #8761864 - Flags: review?(nfroyd)
Whiteboard: [fce-active]
Comment on attachment 8761864 [details] [diff] [review]
Fix

Review of attachment 8761864 [details] [diff] [review]:
-----------------------------------------------------------------

Different patch coming up, clearing r?.
Attachment #8761864 - Flags: review?(jdemooij)
(In reply to Nathan Froyd [:froydnj] from comment #19)
> No really substantial comments since we're going to have a completely
> different patch.
> 

Indeed, but replying for posterity.

> We're trying to treat the lock as a barrier, then?
Indeed. 

> But what happens if
> somebody grabs the lock immediately after we release it?  Why is that not
> dangerous?

We can't deadlock with them. We are explicitly trying to catch the case where we suspended the thread that holds the lock. Any other thread that grabs the lock will synchronize, but make progress in the global table. We know that the thread we Suspended can't have the lock because we could get it.

The whole reason I tossed this approach was that if someone else holds the lock, not the thread we suspended, then we might *still* skip the sample, and this seems senseless. We really want to make this 1-1 barriering relationship explicit.
Any chance of movement on this? I still get bit by this pretty often.
The current patch is reasonable; I think Eric's explanation in comment 21 ("We are explicitly trying...") probably deserves to go in the source code.

(In reply to Eric Faust [:efaust] from comment #21)
> The whole reason I tossed this approach was that if someone else holds the
> lock, not the thread we suspended, then we might *still* skip the sample,
> and this seems senseless. We really want to make this 1-1 barriering
> relationship explicit.

That would be desirable, but this is a sampling profiler, after all.  We may just have to accept that there are places we can't take samples.  Can we try the profiler with the current approach and see if we skip too many samples?

ni? to blassey to decide whether we want to go with the current patch or try and do something else as well, since this is needed to make the profiler not awful on win64.
Flags: needinfo?(efaustbmo)
Flags: needinfo?(blassey.bugs)
My preference is to land the fix we have in hand now and file a follow up to try to find a better solution later.
Flags: needinfo?(blassey.bugs)
Sorry for driveby, but since I was on the cc: do you really need a lock or could you just use an Atomic<bool> on the ThreadInfo, such that, after SuspendThread(), you just check the bool and then you set/clear the bool before/after the critical section?  Also, since that's pretty cheap, do you really need to explicitly opt-in each ThreadInfo with canInvokeJS (which sounds like a future-hazard) or can you just have the Atomic<bool> for every thread and most threads simply never set the bool b/c they don't run JS.
Comment on attachment 8761864 [details] [diff] [review]
Fix

Review of attachment 8761864 [details] [diff] [review]:
-----------------------------------------------------------------

This is better than nothing.  Eric is going to address comment 23 before landing.
Attachment #8761864 - Flags: review+
Pushed by efaustbmo@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/48694b762a84
Avoid deadlock between the JIT and the gecko profiler on win64. (r=froydnj)
Pushed by efaustbmo@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/39da44265946
Avoid deadlock between the JIT and the gecko profiler on win64. (r=froydnj)
Pushed by efaustbmo@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ce1a7322a4a3
Avoid deadlock between the JIT and the gecko profiler on win64. (r=froydnj)
https://hg.mozilla.org/mozilla-central/rev/ce1a7322a4a3
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Depends on: 1304694
Eric has landed this.
Flags: needinfo?(efaustbmo)
Since this has been fixed i been running with the profiler continuously enabled. Today i experienced another hang. See attached windbg log.
I'm also still experiencing hangs with the Profiler enabled. :/ Hey efaust - can you tell if The 8472's windbg log indicates that this bug hasn't been fully fixed, or is this a new hang that's unrelated to this bug?
Flags: needinfo?(efaustbmo)
Another hang, this time it took only minutes after enabling the profiler on a day-old session for the hang to occur.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Eric, it looks like this stack walking deadlock is still reproducible.
xul!RegisterExecutableMemory+0x72 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jit\executableallocatorwin.cpp @ 169]:
00000001`8031acba ff1528de1d02    call    qword ptr [xul!_imp_VirtualProtect (00000001`824f8ae8)]
00000001`8031acc0 85c0            test    eax,eax
00000001`8031acc2 7426            je      xul!RegisterExecutableMemory+0xa2 (00000001`8031acea)
00000001`8031acc4 4c8bc3          mov     r8,rbx
00000001`8031acc7 ba01000000      mov     edx,1
00000001`8031accc 488bcb          mov     rcx,rbx
00000001`8031accf ff1593d91d02    call    qword ptr [xul!_imp_RtlAddFunctionTable (00000001`824f8668)]

I bet MOZ_STACKWALKING is undefined in js/.
[Tracking Requested - why for this release]:

It would be good to get this fixed in Firefox 52 or 53 because we plan to roll-out 64-bit Firefox to more people as the default for new installs starting in 52.

I can easily repro this hang on 64-bit Nightly 53.

STR:

1. Install and enable Gecko Profiler: https://github.com/bgirard/Gecko-Profiler-Addon
2. From the Gecko Profiler's toolbar button, click the "Analyze" button.

RESULT:

Hang! Restarting Firefox will hang again (because the Analyze page is open). To recover, I had kill the Firefox process, open my profile in 32-bit Firefox, and disable the Gecko Profiler there.
Copied verbatim from the top-level configure. Untested!
Attachment #8812246 - Flags: review?(mh+mozilla)
Assignee: efaustbmo → dmajor
Attachment #8812246 - Flags: review?(mh+mozilla) → review+
Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5766462e2be7
followup: define MOZ_STACKWALKING in the JS configure. r=glandium
https://hg.mozilla.org/mozilla-central/rev/5766462e2be7
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
David, if your stack walking fix doesn't cause any problems on Nightly 53, do you think it is safe uplift it to Aurora 52?

We plan to run an A/B experiment with Firefox 52 release to make x86-64 the default architecture for new installs, so we'd like 52 to include all known 64-bit stability fixes.
Flags: needinfo?(efaustbmo) → needinfo?(dmajor)
Tracking 53- since this is now resolved fixed.
As far as I can tell, the aurora channel doesn't enable profiling, so it shouldn't be able to hit this bug. (mozglue.dll from the latest Win64 m-a doesn't export AcquireStackWalkWorkaroundLock, so I conclude that the #define isn't set.) And I know for sure that beta and release don't enable profiling.
(In reply to David Major [:dmajor] from comment #46)
> As far as I can tell, the aurora channel doesn't enable profiling, so it
> shouldn't be able to hit this bug.

Sounds good. Thanks.
I still hit this bug in a recent Nightly. Maybe it's not exactly this deadlock, but the behaviour is the same; 64-bit Windows, Profiler enabled, when I do something that might trigger a short-ish hang, I get a perma-hang.
If it hangs again, could you grab a backtrace from all the threads? ("~*k" in WinDbg, VS would be fine too but I don't know the command)
Attached file WinDbg output
I was able to attach WinDbg and run ~*k when in this state. I've just attached the output.
Is the above any use?
Flags: needinfo?(dmajor)
Looks like we have four threads digging around in lookup tables, the hang monitor:

  17  Id: 2b8c.1b14 Suspend: 1 Teb: 000007ff`fff90000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`0f42ab78 00000000`77308db8 ntdll!NtWaitForSingleObject+0xa
00000000`0f42ab80 00000000`77308cb4 ntdll!RtlpWaitOnCriticalSection+0xe8
00000000`0f42ac30 00000000`77332498 ntdll!RtlEnterCriticalSection+0xd1
00000000`0f42ac60 00000000`77317c9a ntdll!RtlpLookupDynamicFunctionEntry+0x58
00000000`0f42ac90 000007fe`f99d33be ntdll!RtlLookupFunctionEntry+0xa3
00000000`0f42acc0 000007fe`f99d2e7d mozglue!WalkStackMain64+0x1aa [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\mozglue\misc\stackwalk.cpp @ 391]
00000000`0f42b210 000007fe`dc15a46a mozglue!MozStackWalk+0x17d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\mozglue\misc\stackwalk.cpp @ 612]
00000000`0f42f330 000007fe`dc1114cd xul!mozilla::HangMonitor::GetChromeHangReport+0xe2 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\threads\hangmonitor.cpp @ 187]
00000000`0f42f8e0 000007fe`ef262fbe xul!mozilla::HangMonitor::ThreadMain+0x33ce65 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\threads\hangmonitor.cpp @ 254]
00000000`0f42f990 000007fe`ef25671a nss3!_PR_NativeRunThread+0xee [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\nsprpub\pr\src\threads\combined\pruthr.c @ 419]
00000000`0f42f9c0 000007fe`f020cab0 nss3!pr_root+0xa [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\nsprpub\pr\src\md\windows\w95thred.c @ 96]
00000000`0f42f9f0 00000000`770b59bd ucrtbase!o__realloc_base+0x60
00000000`0f42fa20 00000000`772ea2e1 kernel32!BaseThreadInitThunk+0xd
00000000`0f42fa50 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

and the main thread:

   0  Id: 2b8c.3204 Suspend: 3 Teb: 000007ff`fffdd000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`00382d00 00000000`77317c9a ntdll!RtlpLookupDynamicFunctionEntry+0x6c
00000000`00382d30 000007fe`f99d33be ntdll!RtlLookupFunctionEntry+0xa3
00000000`00382d60 000007fe`f99d2e7d mozglue!WalkStackMain64+0x1aa [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\mozglue\misc\stackwalk.cpp @ 391]
00000000`003832b0 000007fe`dd376aef mozglue!MozStackWalk+0x17d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\mozglue\misc\stackwalk.cpp @ 612]
00000000`003873d0 000007fe`dd373862 xul!GeckoSampler::doNativeBacktrace+0x9b [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\geckosampler.cpp @ 1000]
00000000`0038b2c0 000007fe`dd372c5f xul!GeckoSampler::InplaceTick+0x8a [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\geckosampler.cpp @ 1218]
00000000`0038b360 000007fe`dc0ee001 xul!GeckoSampler::GetBacktrace+0xdb [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\geckosampler.cpp @ 1305]
00000000`0038b8f0 000007fe`dbcbac5f xul!mozilla_sampler_get_backtrace+0x39a96d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\platform.cpp @ 1139]
00000000`0038b920 000007fe`db729828 xul!mozilla::PresShell::ScheduleReflow+0x2b [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\base\presshell.cpp @ 9197]
00000000`0038b950 000007fe`dbb6eaba xul!mozilla::PresShell::FrameNeedsReflow+0x1bc [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\base\presshell.cpp @ 2750]
00000000`0038bb80 000007fe`dbff805b xul!mozilla::PresShell::Initialize+0x21e [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\base\presshell.cpp @ 1795]
00000000`0038bc00 000007fe`dc075144 xul!nsDocumentViewer::InitPresentationStuff+0x6562bf [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\base\nsdocumentviewer.cpp @ 703]
00000000`0038bc70 000007fe`dbb6dfb8 xul!nsDocumentViewer::Show+0x507400 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\base\nsdocumentviewer.cpp @ 2109]
00000000`0038bd10 000007fe`dbb6bed8 xul!nsDocShell::SetVisibility+0x48 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\docshell\base\nsdocshell.cpp @ 6421]
00000000`0038bd40 000007fe`dbb6bcea xul!nsFrameLoader::Show+0x1bc [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\base\nsframeloader.cpp @ 1070]
00000000`0038beb0 000007fe`dbb6bc0e xul!nsSubDocumentFrame::ShowViewer+0xb2 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\generic\nssubdocumentframe.cpp @ 191]
00000000`0038bf00 000007fe`db66f7dc xul!AsyncFrameInit::Run+0x32 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\generic\nssubdocumentframe.cpp @ 93]
00000000`0038bf30 000007fe`db8e63b2 xul!nsContentUtils::RemoveScriptBlocker+0xec [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\base\nscontentutils.cpp @ 5223]
00000000`0038bf70 000007fe`db9ce76a xul!mozilla::PresShell::FlushPendingNotifications+0x1b2 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\layout\base\presshell.cpp @ 4122]
00000000`0038c090 000007fe`db8dac8f xul!nsDocument::FlushPendingNotifications+0x11a [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\base\nsdocument.cpp @ 7759]

and the sampler thread:

  47  Id: 2b8c.c94 Suspend: 1 Teb: 000007ff`fff3c000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`39e46a58 00000000`77308db8 ntdll!NtWaitForSingleObject+0xa
00000000`39e46a60 00000000`77308cb4 ntdll!RtlpWaitOnCriticalSection+0xe8
00000000`39e46b10 00000000`77332498 ntdll!RtlEnterCriticalSection+0xd1
00000000`39e46b40 00000000`77317c9a ntdll!RtlpLookupDynamicFunctionEntry+0x58
00000000`39e46b70 000007fe`f99d33be ntdll!RtlLookupFunctionEntry+0xa3
00000000`39e46ba0 000007fe`f99d2e7d mozglue!WalkStackMain64+0x1aa [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\mozglue\misc\stackwalk.cpp @ 391]
00000000`39e470f0 000007fe`dd376aef mozglue!MozStackWalk+0x17d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\mozglue\misc\stackwalk.cpp @ 612]
00000000`39e4b210 000007fe`dd373862 xul!GeckoSampler::doNativeBacktrace+0x9b [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\geckosampler.cpp @ 1000]
00000000`39e4f100 000007fe`dd36f68d xul!GeckoSampler::InplaceTick+0x8a [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\geckosampler.cpp @ 1218]
00000000`39e4f1a0 000007fe`dd36f49e xul!SamplerThread::SampleContext+0x181 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\platform-win32.cc @ 250]
00000000`39e4f760 000007fe`dbc473a2 xul!SamplerThread::Run+0xda [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\platform-win32.cc @ 152]
00000000`39e4f7c0 000007fe`f020cab0 xul!`anonymous namespace'::ThreadFunc+0xa [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\ipc\chromium\src\base\platform_thread_win.cc @ 29]
00000000`39e4f7f0 00000000`770b59bd ucrtbase!o__realloc_base+0x60
00000000`39e4f820 00000000`772ea2e1 kernel32!BaseThreadInitThunk+0xd
00000000`39e4f850 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

and a JS thread:

  55  Id: 2b8c.1d00 Suspend: 1 Teb: 000007ff`fff58000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`3281eae8 00000000`77308db8 ntdll!NtWaitForSingleObject+0xa
00000000`3281eaf0 00000000`77308cb4 ntdll!RtlpWaitOnCriticalSection+0xe8
00000000`3281eba0 00000000`773852f0 ntdll!RtlEnterCriticalSection+0xd1
00000000`3281ebd0 000007fe`dbd42a39 ntdll!RtlDeleteFunctionTable+0x20
00000000`3281ec00 000007fe`db891125 xul!js::jit::DeallocateExecutableMemory+0x25 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jit\executableallocatorwin.cpp @ 243]
00000000`3281ec30 000007fe`db89106e xul!js::jit::ExecutableAllocator::releasePoolPages+0x35 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jit\executableallocator.cpp @ 286]
00000000`3281ec80 000007fe`db88e427 xul!js::jit::ExecutableAllocator::poisonCode+0x10e [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jit\executableallocator.cpp @ 407]
00000000`3281ecd0 000007fe`db861a37 xul!js::FreeOp::~FreeOp+0x5b [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\vm\runtime.cpp @ 713]
00000000`3281ed00 000007fe`db861112 xul!js::gc::GCRuntime::sweepPhase+0x35f [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jsgc.cpp @ 5398]
00000000`3281ee50 000007fe`db9466f7 xul!js::gc::GCRuntime::incrementalCollectSlice+0x1da [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jsgc.cpp @ 5907]
00000000`3281efc0 000007fe`db9463c0 xul!js::gc::GCRuntime::gcCycle+0x1ab [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jsgc.cpp @ 6193]
00000000`3281f0d0 000007fe`db9f4841 xul!js::gc::GCRuntime::collect+0xc0 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jsgc.cpp @ 6339]
00000000`3281f1b0 000007fe`db9f473d xul!js::gc::GCRuntime::gc+0x41 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\js\src\jsgc.cpp @ 6399]
00000000`3281f200 000007fe`db9f4697 xul!mozilla::dom::workers::WorkerPrivate::GarbageCollectInternal+0x9d [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\workers\workerprivate.cpp @ 6269]
00000000`3281f230 000007fe`db9493ad xul!`anonymous namespace'::GarbageCollectRunnable::WorkerRun+0x17 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\workers\workerprivate.cpp @ 1395]
00000000`3281f260 000007fe`db8a5e2d xul!mozilla::dom::workers::WorkerRunnable::Run+0x129 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\workers\workerrunnable.cpp @ 376]
00000000`3281f410 000007fe`db8a5a3b xul!mozilla::dom::workers::WorkerPrivate::ProcessAllControlRunnablesLocked+0x51 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\workers\workerprivate.cpp @ 5055]
00000000`3281f450 000007fe`dbc495e7 xul!mozilla::dom::workers::WorkerPrivate::DoRunLoop+0x123 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\workers\workerprivate.cpp @ 4577]
00000000`3281f580 000007fe`db68c3a3 xul!`anonymous namespace'::WorkerThreadPrimaryRunnable::Run+0x16f [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\dom\workers\runtimeservice.cpp @ 2876]
00000000`3281f8b0 000007fe`db8a6152 xul!nsThread::ProcessNextEvent+0x26b [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\threads\nsthread.cpp @ 1213]

(LazyIdleThread is dealing with locks, but I don't think that particular one concerns us?)

So these conditions are kind of like what the patch was originally supposed to solve...except that it's clearly not...
In the previous log at comment 37, the suspended main thread held the function-table lock because it was inside RegisterExecutableMemory. That case ought to be fixed now.

In the log at comment 52, the suspended main thread holds the function-table lock because ScheduleReflow/AddLayoutFlushObserver wants a backtrace. 

Naively, it seems to me that _any_ operation on the main thread that touches function-tables needs to be behind AcquireStackWalkWorkaroundLock. Not knowing the architecture here (I just stepped in to help with the #ifdefs!), I can't tell if we're just missing a one-off lock in the PresShell backtrace stuff, or if there's a larger problem that needs to be addressed more generally? Is there anything else that the main thread could do to end up in this situation?
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #53)
> In the log at comment 52, the suspended main thread holds the function-table
> lock because ScheduleReflow/AddLayoutFlushObserver wants a backtrace. 

This was added in bug 926922 by BenWa. ehsan was one of the reviewers there

> (I just stepped in to help with the #ifdefs!),

I know the feeling. :) Thanks for your work on this so far, regardless.

> I can't tell if we're just missing a one-off lock in the PresShell backtrace
> stuff, or if there's a larger problem that needs to be addressed more
> generally? Is there anything else that the main thread could do to end up in
> this situation?

So I did a searchfox search for profiler_get_backtrace. Manually inspecting, here are a few more places this could happen:

http://searchfox.org/mozilla-central/rev/36bfd0a8da5490274ca49c100957244253845e52/tools/profiler/gecko/ProfilerIOInterposeObserver.cpp#17
http://searchfox.org/mozilla-central/rev/36bfd0a8da5490274ca49c100957244253845e52/tools/profiler/core/ProfilerMarkers.cpp#60
http://searchfox.org/mozilla-central/rev/36bfd0a8da5490274ca49c100957244253845e52/layout/base/nsRefreshDriver.h#164
http://searchfox.org/mozilla-central/rev/36bfd0a8da5490274ca49c100957244253845e52/layout/base/RestyleTracker.h#270

Perhaps we should modify profiler_get_backtrace, or one of the things it calls, to try to acquire the workaround lock?

I'm really uncertain how to proceed here, but as we slowly start shifting towards 64-bit by default, I can see this becoming more and more of an issue.

I'm going to open a new bug for this, and see if I can make sure the folks driving Win 64 know what's up here.
See Also: → 1322735
Depends on: 1336877
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: