Closed Bug 1120126 Opened 7 years ago Closed 7 years ago
crash in Rtl
Virtual Unwind | Rtlpx Lookup Function Table | alloca _probe
This bug was filed from the Socorro interface and is report bp-b9e2e8b8-6969-4c1a-aefd-1c8aa2150110. ============================================================= Startup crash on Win64 nightly. I have the profiler addon installed and turned on, and I see profiler stuff calling into NS_StackWalk. Presumably fallout from bug 1088343.
I'm inexperienced at debugging on Windows, but here's what I've found. - I can reproduce locally, which is good. Just start a Win64 build (debug or opt) with the profiling add-on installed. - It's notable that DMD is working fine, because it wants to walk the current stack and so has to use the auxiliary thread to do so. In contrast, the profiler wants to walk another thread's stack. So it's probably related to that somehow. - The crash report's stack trace is weird: > 0 ntdll.dll RtlVirtualUnwind > 1 ntdll.dll RtlpxLookupFunctionTable > 2 d3d11.dll alloca_probe > 3 kernel32.dll RtlVirtualUnwindStub > 4 d3d11.dll CSampler::GetDesc(D3D11_SAMPLER_DESC*) > 5 xul.dll WalkStackMain64 xpcom/base/nsStackWalk.cpp > 6 xul.dll NS_StackWalk xpcom/base/nsStackWalk.cpp WalkStackMain64 calls RtlVirtualUnwind directly, so I don't know why frames 1--4 are present and what they mean. - Nonetheless, it looks like RtlVirtualUnwind is crashing, which seems quite uncool for a system function. - Locally, Visual Studio gives this stack trace for the crash: > ntdll.dll > kernel32.dll > WalkStackMain64 > NS_StackWalk which looks more like what I'd expect. - Visual Studio is also unable to access the memory pointed to by |aData| within WalkStackMain64. I don't know why that is; it's just a local variable struct whose address is passed in from NS_StackWalk, and it can access it from the frame below. - The number of stack traces obtained before crashing varies. I've seen 0, 4, 5, 11, 24, 25.
Any known issues with Socorro's Win64 stack walker? VS and WinDbg give correct results.
The disassembly before the crash (in comment 0 at least) is: 00007ffe`9a1d9a5f 410fb7546e04 movzx edx,word ptr [r14+rbp*2+4] 00007ffe`9a1d9a65 c1e203 shl edx,3 00007ffe`9a1d9a68 4903d3 add rdx,r11 00007ffe`9a1d9a6b 488b02 mov rax,qword ptr [rdx] I've checked that @rax has the value you would expect, in other words we didn't jump into the middle of this sequence. So @r14 is the "interesting" register and the rest is just a little manipulation. 0:073> !address @r14 [snip] Type: 01000000 MEM_IMAGE Image Path: C:\Windows\System32\d3d11.dll Busted unwind data from d3d11? Either that, or busted unwind data in our code that caused the unwinder to deref something in d3d.
I'm able to reproduce this. In a handful of attempts I've mostly crashed at another part of RtlVirtualUnwind. I did once crash at the same place at comment 0, and in that case my "interesting register" was inside xul.dll memory. So I think the d3d thing is a red herring and it's something in our data structures that sends the walker astray. Since the crash site isn't consistent, I'm going to see if I can rig up a poor man's "rr" with a VM snapshot. > I've checked that @rax has the value you would expect (that should have said @rdx)
In my crash at comment 4, the profiler was trying to walk the main thread, which had a thoroughly boring stack: mostly CSS restyle stuff, no JIT or external binaries etc.
This does the trick. I found this through trial and error -- WalkStackMain64 uses CONTEXT_FULL when a context isn't provided from above so I just tried doing that here and it worked. It's possible that we could use something other than CONTEXT_FULL, but the online docs for CONTEXT (http://msdn.microsoft.com/en-us/library/windows/desktop/ms679284%28v=vs.85%29.aspx) just say "look at WinNT.h" and supposedly that file is somewhere in a MSVC installation but I sure can't find it. And since we use CONTEXT_FULL in the not-provided-from-above case it seems reasonable to use it here, too.
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
It looks like `CONTEXT_CONTROL | CONTEXT_INTEGER` can be used instead of CONTEXT_FULL, but CONTEXT_FULL is probably better for consistency with other places.
Comment on attachment 8547864 [details] [diff] [review] Fix crash in RtlVirtualUnwind when starting the Gecko profiler on Win64 Review of attachment 8547864 [details] [diff] [review]: ----------------------------------------------------------------- Looking at my winnt.h, CONTEXT_FULL sounds reasonable. I was going to suggest that we could just use CONTEXT_FULL everywhere, but it looks like bug 816117 deliberately chose otherwise for perf reasons. It may be helpful to comment that x64 needs FULL for correctness and x86 needs CONTROL for performance.
Attachment #8547864 - Flags: review?(dmajor) → review+
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
(In reply to David Major [:dmajor] (UTC+13) from comment #2) > Any known issues with Socorro's Win64 stack walker? VS and WinDbg give > correct results. Sorry, I've had this open in a tab forever and keep forgetting to look into it. If you look at the symbol file for the ntdll.dll from the top frame: http://symbols.mozilla.org/ntdll.pdb/90C5F8ECD0194F088E74C5F6F935C1E91/ntdll.sym you'll note a lack of STACK WIN records. The binary we're using to fetch missing symbols from Microsoft's symbol server isn't fetching the matching .dll file, and Breakpad's dump_syms needs the .dll to get the unwind data for Win64 binaries. :-( I filed bug 1140358 on this.
You need to log in before you can comment on or make changes to this bug.