Closed Bug 1088343 Opened 5 years ago Closed 5 years ago

Get DMD working on 64-bit Windows builds (stack walking doesn't work)

Categories

(Core :: DMD, defect)

x86_64
Windows 8
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla37

People

(Reporter: njn, Assigned: njn)

References

Details

Attachments

(3 files, 1 obsolete file)

DMD currently doesn't work on the 64-bit Windows 8 test machines because every attempt to do stack-walking results in:

> ERROR: WalkStack64: The handle is invalid.
Once this is working, test_dmd.js needs to be enabled.
StackWalk64() is failing. The docs for it are here: http://msdn.microsoft.com/en-us/library/windows/desktop/ms680650%28v=vs.85%29.aspx

Unfortunately it just returns a bool to indicate success/failure, and on failure it doesn't set any kind of helpful error code :(
> > ERROR: WalkStack64: The handle is invalid.

Turns out error messagethis is a red herring -- it occurs on 32-bit Windows too,
but things work correctly there anyway.

On 64-bit, we are getting some stack walking happening. NS_StackWalk normally
skips the first few frames in a stack trace. If I disable that skipping, I see
lots of stack traces like this on 64-bit:

> #01: NtSignalAndWaitForSingleObject[ntdll +0x97eaa]
> #02: SignalObjectAndWait[KERNELBASE +0x85ebe]
> #03: ??? (???:???)
> #04: CreateEventA[KERNELBASE +0x196c]

In comparison, on 32-bit with skipping disabled I see stack traces like this:

> #01: ZwSignalAndWaitForSingleObject[ntdll +0x3d52c]
> #02: NS_StackWalk (c:\\users\\nicholas\\moz\\mi1\\xpcom\\base\\nsstackwalk.cpp:605)
> #03: mozilla::dmd::StackTrace::Get (c:\\users\\nicholas\\moz\\mi1\\memory\\replace\\dmd\\dmd.cpp:673)
> #04: mozilla::dmd::AllocCallback (c:\\users\\nicholas\\moz\\mi1\\memory\\replace\\dmd\\dmd.cpp:983)
> #05: replace_malloc (c:\\users\\nicholas\\moz\\mi1\\memory\\replace\\dmd\\dmd.cpp:1044)
> #06: RunTests (c:\\users\\nicholas\\moz\\mi1\\memory\\replace\\dmd\\test\\smokedmd.cpp:130)
> #07: main (c:\\users\\nicholas\\moz\\mi1\\memory\\replace\\dmd\\test\\smokedmd.cpp:324)
> #08: __tmainCRTStartup (f:\\dd\\vctools\\crt\\crtw32\\dllstuff\\crtexe.c:626)
> #09: BaseThreadInitThunk[KERNEL32 +0x1495d]
> #10: RtlInitializeExceptionChain[ntdll +0x498ee]
> #11: RtlInitializeExceptionChain[ntdll +0x498c4]

It almost looks like the stack trace we get on 64-bit is from the wrong thread? Not sure.

> * @param aPlatformData Platform specific data that can help in walking the
> *                      stack, this should be nullptr unless you really know
> *                      what you're doing! This needs to be a pointer to a
> *                      CONTEXT on Windows and should not be passed on other
> *                      platforms.
Another thing: NS_StackWalk() has this parameter:

> > * @param aPlatformData Platform specific data that can help in walking the
> > *                      stack, this should be nullptr unless you really know
> > *                      what you're doing! This needs to be a pointer to a
> > *                      CONTEXT on Windows and should not be passed on other
> > *                      platforms.

In DMD we just use nullptr, but I see that SPS sets this to a CONTEXT. Perhaps that's a problem? Though it doesn't seem to hurt on 32-bit...
dmajor requested steps to reproduce this:

- Do a --enable-debug --enable-dmd --enable-profiling 64-bit build.

- cd into $OBJDIR/dist/bin/

- Run |DMD=1 MOZ_REPLACE_MALLOC_LIB=dmd.dll SmokeDMD.exe|

If you want to debug, NS_StackWalk() and WalkStackThread() are the interesting functions to set breakpoints on, with the latter being the one that actually calls StackWalk64().
> On 64-bit, we are getting some stack walking happening. NS_StackWalk normally
> skips the first few frames in a stack trace. If I disable that skipping, I
> see
> lots of stack traces like this on 64-bit:
> 
> > #01: NtSignalAndWaitForSingleObject[ntdll +0x97eaa]
> > #02: SignalObjectAndWait[KERNELBASE +0x85ebe]
> > #03: ??? (???:???)
> > #04: CreateEventA[KERNELBASE +0x196c]
> 
> It almost looks like the stack trace we get on 64-bit is from the wrong
> thread? Not sure.

I think it's the right thread, but it gets confused at frame #03. My debugger says this is the stack for thread 0 (the thread that we're interested in):

ntdll!ZwSignalAndWaitForSingleObject
kernel32!SignalObjectAndWait
dmd!NS_StackWalk
dmd!mozilla::dmd::StackTrace::Get
dmd!mozilla::dmd::AllocCallback
dmd!replace_malloc
SmokeDMD!RunTests
SmokeDMD!main
SmokeDMD!__tmainCRTStartup
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

StackWalk64 gets the first two frames right, but on the third call, the |addr| is 0x4. (Makes sense that later results aren't trustworthy). Perhaps tracing back the |addr| and |spaddr| results, and comparing versus the 'good' frames, can help figure out where things went wrong?
> StackWalk64 gets the first two frames right, but on the third call, the
> |addr| is 0x4. (Makes sense that later results aren't trustworthy). Perhaps
> tracing back the |addr| and |spaddr| results, and comparing versus the
> 'good' frames, can help figure out where things went wrong?

On the first two frames both |addr| and |spaddr| look reasonable, i.e. pointer values. On the third frame |spaddr| still looks reasonable, but I get values like 0x48 for |addr|, similar to your 0x4. On the fourth frame they both look reasonable but things are probably unreliable by that point.
OS: Linux → Windows 8
Summary: Get DMD working on 64-bit Windows builds → Get DMD working on 64-bit Windows builds (stack walking doesn't work)
FYI, support for stack walking on Win64 was initially added in https://hg.mozilla.org/mozilla-central/rev/a8f3ae861697 , and was perhaps significantly refactored by https://hg.mozilla.org/mozilla-central/rev/3aa3c980b5ec .
I only changed hand-written dynamic loading to /delayload:dbghelp.dll and removed the legacy code path which was not used on Win64 anyway.
Duplicate of this bug: 1107581
Using RtlVirtualUnwind() might help here (http://msdn.microsoft.com/en-us/library/ms680617%28VS.85%29.aspx).
Since bug 1107581 wasn't necessarily Win64-specific, note that RtlLookupFunctionEntry/RtlVirtualUnwind do not exist on 32-bit Windows.
dmajor, ehsan: feel free to give feedback, though I won't necessarily wait on
an f+ from both of you.

This appears to work well.

- DMD is working -- the tests pass locally and on try, and a dump done in a
  full browser run looks good.

- Stacks printed on assertion failures in mochitests look pretty good, both
  locally and on try. If the stack has numerous ??? entries we don't trace
  through it when the StackWalk64() version does (on 32-bit).
Attachment #8545796 - Flags: review?(aklotz)
Attachment #8545796 - Flags: feedback?(ehsan)
Attachment #8545796 - Flags: feedback?(dmajor)
Attachment #8545797 - Flags: review?(mh+mozilla)
Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d5ad5746c5dd

> If the stack has numerous ??? entries we don't trace
> through it when the StackWalk64() version does (on 32-bit).

For more details about this, here's a Win32 log for M-o:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-d5ad5746c5dd/try-win32-debug/try_win7-ix-debug_test-mochitest-other-bm109-tests1-windows-build3459.txt.gz

And here's one for Win64 with the patches applied:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-d5ad5746c5dd/try-win64-debug/try_win8_64-debug_test-mochitest-other-bm110-tests1-windows-build349.txt.gz

In the former, the stack trace just under the first "too-big.com" assertion failure looks like this:

> 00:14:19     INFO -  [Parent 2976] ###!!! ASSERTION: Invalid value (157286400 / 102760448) for explicit/js/compartment(http://too-big.com/)/stuff: 'false', file aboutMemory.js, line 0
> 00:14:19     INFO -  #01: NS_InvokeByIndex [xpcom/reflect/xptcall/md/win32/xptcinvoke.cpp:71]
> 00:14:19     INFO -  #02: CallMethodHelper::Invoke() [js/xpconnect/src/XPCWrappedNative.cpp:2384]
> 00:14:19     INFO -  #03: XPCWrappedNative::CallMethod(XPCCallContext &,XPCWrappedNative::CallMode) [js/xpconnect/src/XPCWrappedNative.cpp:1703]
> 00:14:19     INFO -  #04: XPC_WN_CallMethod(JSContext *,unsigned int,JS::Value *) [js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1174]
> 00:14:19     INFO -  #05: js::CallJSNative(JSContext *,bool (*)(JSContext *,unsigned int,JS::Value *),JS::CallArgs const &) [js/src/jscntxtinlines.h:229]
> 00:14:19     INFO -  #06: js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) [js/src/vm/Interpreter.cpp:502]
> 00:14:19     INFO -  #07: Interpret [js/src/vm/Interpreter.cpp:2560]
> 00:14:19     INFO -  #08: js::RunScript(JSContext *,js::RunState &) [js/src/vm/Interpreter.cpp:452]
> 00:14:19     INFO -  #09: js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) [js/src/vm/Interpreter.cpp:521]
> 00:14:19     INFO -  #10: js::Invoke(JSContext *,JS::Value const &,JS::Value const &,unsigned int,JS::Value const *,JS::MutableHandle<JS::Value>) [js/src/vm/Interpreter.cpp:558]
> 00:14:19     INFO -  #11: js::jit::DoCallFallback [js/src/jit/BaselineIC.cpp:9498]
> 00:14:19     INFO -  #12: ??? (???:???)
> 00:14:19     INFO -  #13: ??? (???:???)
> 00:14:19     INFO -  #14: ??? (???:???)
> 00:14:19     INFO -  #15: ??? (???:???)
> 00:14:19     INFO -  #16: ??? (???:???)
> 00:14:19     INFO -  #17: ??? (???:???)
> 00:14:19     INFO -  #18: ??? (???:???)
> 00:14:19     INFO -  #19: EnterBaseline [js/src/jit/BaselineJIT.cpp:125]
> 00:14:19     INFO -  #20: js::jit::EnterBaselineMethod(JSContext *,js::RunState &) [js/src/jit/BaselineJIT.cpp:154]
> 00:14:19     INFO -  #21: js::RunScript(JSContext *,js::RunState &) [js/src/vm/Interpreter.cpp:442]
> 00:14:19     INFO -  #22: js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) [js/src/vm/Interpreter.cpp:521]
> 00:14:19     INFO -  #23: js::Invoke(JSContext *,JS::Value const &,JS::Value const &,unsigned int,JS::Value const *,JS::MutableHandle<JS::Value>) [js/src/vm/Interpreter.cpp:558]
> 00:14:19     INFO -  #24: js::jit::DoCallFallback [js/src/jit/BaselineIC.cpp:9498]
> 00:14:19     INFO -  #25: ??? (???:???)
> 00:14:19     INFO -  #26: ??? (???:???)
> 00:14:19     INFO -  #27: ??? (???:???)
> 00:14:19     INFO -  #28: EnterBaseline [js/src/jit/BaselineJIT.cpp:125]
> 00:14:19     INFO -  #29: js::jit::EnterBaselineMethod(JSContext *,js::RunState &) [js/src/jit/BaselineJIT.cpp:154]
> 00:14:19     INFO -  #30: Interpret [js/src/vm/Interpreter.cpp:3503]
> 00:14:19     INFO -  #31: js::RunScript(JSContext *,js::RunState &) [js/src/vm/Interpreter.cpp:452]
> 00:14:19     INFO -  #32: js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) [js/src/vm/Interpreter.cpp:521]
> 00:14:19     INFO -  #33: js::Invoke(JSContext *,JS::Value const &,JS::Value const &,unsigned int,JS::Value const *,JS::MutableHandle<JS::Value>) [js/src/vm/Interpreter.cpp:558]
> 00:14:19     INFO -  #34: JS_CallFunctionValue(JSContext *,JS::Handle<JSObject *>,JS::Handle<JS::Value>,JS::HandleValueArray const &,JS::MutableHandle<JS::Value>) [js/src/jsapi.cpp:4547]
> 00:14:19     INFO -  #35: nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS *,unsigned short,XPTMethodDescriptor const *,nsXPTCMiniVariant *) [js/xpconnect/src/XPCWrappedJSClass.cpp:1206]
> 00:14:19     INFO -  #36: nsXPCWrappedJS::CallMethod(unsigned short,XPTMethodDescriptor const *,nsXPTCMiniVariant *) [js/xpconnect/src/XPCWrappedJS.cpp:533]
> 00:14:19     INFO -  #37: PrepareAndDispatch [xpcom/reflect/xptcall/md/win32/xptcstubs.cpp:85]
> 00:14:19     INFO -  #38: SharedStub [xpcom/reflect/xptcall/md/win32/xptcstubs.cpp:113]
> 00:14:19     INFO -  #39: nsMemoryReporterManager::FinishReporting() [xpcom/base/nsMemoryReporterManager.cpp:1395]
> 00:14:19     INFO -  #40: nsMemoryReporterManager::GetReportsExtended(nsIMemoryReporterCallback *,nsISupports *,nsIFinishReportingCallback *,nsISupports *,bool,bool,nsAString_internal const &) [xpcom/base/nsMemoryReporterManager.cpp:1186]

and so on for another 61 frames down to below XRE_main().

In the latter, it looks like this:

> 00:45:18     INFO -  [Parent 2668] ###!!! ASSERTION: Invalid value (157286400 / 102760448) for explicit/js/compartment(http://too-big.com/)/stuff: 'false', file aboutMemory.js, line 0
> 00:45:18     INFO -  #01: XPTC__InvokebyIndex [xpcom/reflect/xptcall/md/win32/xptcinvoke_asm_x86_64.asm:99]
> 00:45:18     INFO -  #02: CallMethodHelper::Call() [js/xpconnect/src/XPCWrappedNative.cpp:1738]
> 00:45:18     INFO -  #03: XPCWrappedNative::CallMethod(XPCCallContext &,XPCWrappedNative::CallMode) [js/xpconnect/src/XPCWrappedNative.cpp:1703]
> 00:45:18     INFO -  #04: XPC_WN_CallMethod(JSContext *,unsigned int,JS::Value *) [js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1174]
> 00:45:18     INFO -  #05: js::CallJSNative(JSContext *,bool (*)(JSContext *,unsigned int,JS::Value *),JS::CallArgs const &) [js/src/jscntxtinlines.h:229]
> 00:45:18     INFO -  #06: js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) [js/src/vm/Interpreter.cpp:502]
> 00:45:18     INFO -  #07: Interpret [js/src/vm/Interpreter.cpp:2560]
> 00:45:18     INFO -  #08: js::RunScript(JSContext *,js::RunState &) [js/src/vm/Interpreter.cpp:452]
> 00:45:18     INFO -  #09: js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) [js/src/vm/Interpreter.cpp:521]
> 00:45:18     INFO -  #10: js::Invoke(JSContext *,JS::Value const &,JS::Value const &,unsigned int,JS::Value const *,JS::MutableHandle<JS::Value>) [js/src/vm/Interpreter.cpp:558]
> 00:45:18     INFO -  #11: js::jit::DoCallFallback [js/src/jit/BaselineIC.cpp:9497]
> 00:45:18     INFO -  #12: ??? (???:???)
> 00:45:18     INFO -  #13: ??? (???:???)

So it seems to be fooled by the JIT frames. But maybe that's hard to avoid on 64-bit because you need stack unwinding info which the JITs don't provide? Whatever the reason, it's a hell of a lot better than the garbage we were getting before.
Comment on attachment 8545796 [details] [diff] [review]
(part 1) - Get NS_StackWalk() working on Win64

Review of attachment 8545796 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsStackWalk.cpp
@@ +454,5 @@
> +    context = *static_cast<CONTEXT*>(aData->platformData);
> +  }
> +
> +  // Skip our own stack walking frames.
> +  int skip = (aData->walkCallingThread ? 3 : 0) + aData->skipFrames;

This looks great, but I can't say I'm a fan of the code duplication here and at the end of this function.  Can you see if you can unify the two, perhaps by refactoring the stackwalking itself into a StackWalkStep function that does the right thing depending on the architecture?
Attachment #8545796 - Flags: feedback?(ehsan) → feedback+
Comment on attachment 8545796 [details] [diff] [review]
(part 1) - Get NS_StackWalk() working on Win64

Review of attachment 8545796 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me, but I do think you might be missing out on leaf frames. Please see my comment.

::: xpcom/base/nsStackWalk.cpp
@@ +463,5 @@
> +    PRUNTIME_FUNCTION runtimeFunction =
> +      RtlLookupFunctionEntry(context.Rip, &imageBase, NULL);
> +
> +    if (!runtimeFunction) {
> +      break;

I think that you might lose a frame by breaking here. Based on what I've read about RtlLookupFunctionEntry, unwind data is optional for leaf functions that do not manipulate the stack. When RtlLookupFunctionEntry returns nullptr, the code that I've seen is able to adjust the context's Rip and Rsp appropriately and then use those values for the leaf frame.

See also:
http://www.nynaeve.net/?p=101
http://www.nynaeve.net/Code/StackWalk64.cpp
Attachment #8545796 - Flags: review?(aklotz) → feedback+
> I think that you might lose a frame by breaking here. Based on what I've
> read about RtlLookupFunctionEntry, unwind data is optional for leaf
> functions that do not manipulate the stack. When RtlLookupFunctionEntry
> returns nullptr, the code that I've seen is able to adjust the context's Rip
> and Rsp appropriately and then use those values for the leaf frame.
> 
> See also:
> http://www.nynaeve.net/?p=101
> http://www.nynaeve.net/Code/StackWalk64.cpp

I was doing that originally but it didn't seem to help. There were a small fraction of stack traces where !runtimeFunction was true, and in those cases we'd have this weird pattern:

  #1: normal
  #2: normal
  #3: normal
  #4: no runtimeFunction
  #5: no runtimeFunction
  #6: no runtimeFunction

That doesn't match my understanding of what a leaf function is... but I can add it back in.
(In reply to Nicholas Nethercote [:njn] from comment #18)
> > I think that you might lose a frame by breaking here. Based on what I've
> > read about RtlLookupFunctionEntry, unwind data is optional for leaf
> > functions that do not manipulate the stack. When RtlLookupFunctionEntry
> > returns nullptr, the code that I've seen is able to adjust the context's Rip
> > and Rsp appropriately and then use those values for the leaf frame.
> > 
> > See also:
> > http://www.nynaeve.net/?p=101
> > http://www.nynaeve.net/Code/StackWalk64.cpp
> 
> I was doing that originally but it didn't seem to help. There were a small
> fraction of stack traces where !runtimeFunction was true, and in those cases
> we'd have this weird pattern:
> 
>   #1: normal
>   #2: normal
>   #3: normal
>   #4: no runtimeFunction
>   #5: no runtimeFunction
>   #6: no runtimeFunction
> 
> That doesn't match my understanding of what a leaf function is... but I can
> add it back in.

You're right, that looks really odd! What does dmajor think?
Flags: needinfo?(dmajor)
I'm pretty sure (though I can't find the MSDN page right now) that "leaf function" is a misnomer and what it should really be is "a function that doesn't need to save registers on the stack".
On IRC our discussion resolved to remembering that we're probably dealing with JIT frames in comment 18.
Flags: needinfo?(dmajor)
Comment on attachment 8545796 [details] [diff] [review]
(part 1) - Get NS_StackWalk() working on Win64

Review of attachment 8545796 [details] [diff] [review]:
-----------------------------------------------------------------

Based on our discussion on IRC, we decided that this current implementation is good enough to land; we're not going to be worrying about leaf functions in the NS_StackWalk case anyway, and as far as we know we can't do much about our JITs ATM.
Attachment #8545796 - Flags: feedback+ → review+
Comment on attachment 8545796 [details] [diff] [review]
(part 1) - Get NS_StackWalk() working on Win64

Review of attachment 8545796 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsStackWalk.cpp
@@ +363,5 @@
>    frame64.AddrPC.Offset    = context.StIIP;
>    frame64.AddrStack.Offset = context.SP;
>    frame64.AddrFrame.Offset = context.RsBSP;
>  #else
>  #error "Should not have compiled this code"

This will never be reached now -- maybe move it up a level?

@@ +454,5 @@
> +    context = *static_cast<CONTEXT*>(aData->platformData);
> +  }
> +
> +  // Skip our own stack walking frames.
> +  int skip = (aData->walkCallingThread ? 3 : 0) + aData->skipFrames;

Ditto Ehsan's comment, especially with that '3' -- a magic number repeated twice is scary.
Attachment #8545796 - Flags: feedback?(dmajor) → feedback+
Attachment #8545797 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8546230 [details] [diff] [review]
(part 0) - Clean up the windows implementation of NS_StackWalk a little

Review of attachment 8546230 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsStackWalk.cpp
@@ +419,5 @@
>    }
>    return;
>  }
>  
> +static unsigned int WINAPI

I think you only need static on the declaration above.
Attachment #8546230 - Flags: review?(ehsan) → review+
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #20)
> I'm pretty sure (though I can't find the MSDN page right now) that "leaf
> function" is a misnomer and what it should really be is "a function that
> doesn't need to save registers on the stack".

According to http://www.nynaeve.net/?p=101 leaf functions are "functions that both make no direct modifications to the stack pointer (or any nonvolatile registers), and do not call any subfunctions." Still not really relevant for us, I think :)
Here's an updated version that merges the two functions. A few #ifdefs is
probably better than the code duplication.
Attachment #8545796 - Attachment is obsolete: true
Comment on attachment 8546262 [details] [diff] [review]
(part 1) - Get NS_StackWalk working on Win64

Review of attachment 8546262 [details] [diff] [review]:
-----------------------------------------------------------------

Carrying over aklotz's r+, since this just moves things around without changing the functionality at all.
Attachment #8546262 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/14bb2f5eed92
https://hg.mozilla.org/mozilla-central/rev/01a2e2efa168
https://hg.mozilla.org/mozilla-central/rev/3eb9a4e2bab0
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
You need to log in before you can comment on or make changes to this bug.