Closed Bug 1120126 Opened 6 years ago Closed 6 years ago

crash in RtlVirtualUnwind | RtlpxLookupFunctionTable | alloca_probe

Categories

(Core :: XPCOM, defect)

x86_64
Windows 8.1
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla38

People

(Reporter: aklotz, Assigned: njn)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-b9e2e8b8-6969-4c1a-aefd-1c8aa2150110.
=============================================================

Startup crash on Win64 nightly. I have the profiler addon installed and turned on, and I see profiler stuff calling into NS_StackWalk. Presumably fallout from bug 1088343.
Flags: needinfo?(n.nethercote)
I'm inexperienced at debugging on Windows, but here's what I've found.

- I can reproduce locally, which is good. Just start a Win64 build (debug or opt) with the profiling add-on installed.

- It's notable that DMD is working fine, because it wants to walk the current stack and so has to use the auxiliary thread to do so. In contrast, the profiler wants to walk another thread's stack. So it's probably related to that somehow.

- The crash report's stack trace is weird:

> 0 ntdll.dll     RtlVirtualUnwind 	
> 1 ntdll.dll     RtlpxLookupFunctionTable 	
> 2 d3d11.dll     alloca_probe 	
> 3 kernel32.dll  RtlVirtualUnwindStub 	
> 4 d3d11.dll     CSampler::GetDesc(D3D11_SAMPLER_DESC*) 	
> 5 xul.dll       WalkStackMain64       xpcom/base/nsStackWalk.cpp
> 6 xul.dll       NS_StackWalk          xpcom/base/nsStackWalk.cpp

WalkStackMain64 calls RtlVirtualUnwind directly, so I don't know why frames 1--4 are present and what they mean.

- Nonetheless, it looks like RtlVirtualUnwind is crashing, which seems quite uncool for a system function.

- Locally, Visual Studio gives this stack trace for the crash:

> ntdll.dll
> kernel32.dll
> WalkStackMain64
> NS_StackWalk

which looks more like what I'd expect.

- Visual Studio is also unable to access the memory pointed to by |aData| within WalkStackMain64. I don't know why that is; it's just a local variable struct whose address is passed in from NS_StackWalk, and it can access it from the frame below.

- The number of stack traces obtained before crashing varies. I've seen 0, 4, 5, 11, 24, 25.
Flags: needinfo?(n.nethercote)
Any known issues with Socorro's Win64 stack walker? VS and WinDbg give correct results.
Flags: needinfo?(ted)
The disassembly before the crash (in comment 0 at least) is:

00007ffe`9a1d9a5f 410fb7546e04    movzx   edx,word ptr [r14+rbp*2+4]
00007ffe`9a1d9a65 c1e203          shl     edx,3
00007ffe`9a1d9a68 4903d3          add     rdx,r11
00007ffe`9a1d9a6b 488b02          mov     rax,qword ptr [rdx]

I've checked that @rax has the value you would expect, in other words we didn't jump into the middle of this sequence. So @r14 is the "interesting" register and the rest is just a little manipulation.

0:073> !address @r14
[snip]
Type:                   01000000	MEM_IMAGE
Image Path:             C:\Windows\System32\d3d11.dll

Busted unwind data from d3d11? Either that, or busted unwind data in our code that caused the unwinder to deref something in d3d.
I'm able to reproduce this. In a handful of attempts I've mostly crashed at another part of RtlVirtualUnwind. I did once crash at the same place at comment 0, and in that case my "interesting register" was inside xul.dll memory. So I think the d3d thing is a red herring and it's something in our data structures that sends the walker astray.

Since the crash site isn't consistent, I'm going to see if I can rig up a poor man's "rr" with a VM snapshot.

> I've checked that @rax has the value you would expect
(that should have said @rdx)
In my crash at comment 4, the profiler was trying to walk the main thread, which had a thoroughly boring stack: mostly CSS restyle stuff, no JIT or external binaries etc.
This does the trick. I found this through trial and error -- WalkStackMain64
uses CONTEXT_FULL when a context isn't provided from above so I just tried
doing that here and it worked.

It's possible that we could use something other than CONTEXT_FULL, but the
online docs for CONTEXT
(http://msdn.microsoft.com/en-us/library/windows/desktop/ms679284%28v=vs.85%29.aspx)
just say "look at WinNT.h" and supposedly that file is somewhere in a MSVC
installation but I sure can't find it. And since we use CONTEXT_FULL in the
not-provided-from-above case it seems reasonable to use it here, too.
Attachment #8547864 - Flags: review?(dmajor)
Attachment #8547864 - Flags: feedback?(bgirard)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
It looks like `CONTEXT_CONTROL | CONTEXT_INTEGER` can be used instead of CONTEXT_FULL, but CONTEXT_FULL is probably better for consistency with other places.
Comment on attachment 8547864 [details] [diff] [review]
Fix crash in RtlVirtualUnwind when starting the Gecko profiler on Win64

Review of attachment 8547864 [details] [diff] [review]:
-----------------------------------------------------------------

Looking at my winnt.h, CONTEXT_FULL sounds reasonable.

I was going to suggest that we could just use CONTEXT_FULL everywhere, but it looks like bug 816117 deliberately chose otherwise for perf reasons. It may be helpful to comment that x64 needs FULL for correctness and x86 needs CONTROL for performance.
Attachment #8547864 - Flags: review?(dmajor) → review+
Attachment #8547864 - Flags: feedback?(bgirard)
https://hg.mozilla.org/mozilla-central/rev/06f5318758ae
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Duplicate of this bug: 1121407
Blocks: 1121571
(In reply to David Major [:dmajor] (UTC+13) from comment #2)
> Any known issues with Socorro's Win64 stack walker? VS and WinDbg give
> correct results.

Sorry, I've had this open in a tab forever and keep forgetting to look into it. If you look at the symbol file for the ntdll.dll from the top frame:
http://symbols.mozilla.org/ntdll.pdb/90C5F8ECD0194F088E74C5F6F935C1E91/ntdll.sym

you'll note a lack of STACK WIN records. The binary we're using to fetch missing symbols from Microsoft's symbol server isn't fetching the matching .dll file, and Breakpad's dump_syms needs the .dll to get the unwind data for Win64 binaries. :-( I filed bug 1140358 on this.
Flags: needinfo?(ted)
You need to log in before you can comment on or make changes to this bug.