Closed Bug 948621 Opened 6 years ago Closed 6 years ago

Crash in DMD on Windows opt builds

Categories

(Core :: DMD, defect)

x86_64
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla29

People

(Reporter: iacobcatalin, Assigned: njn)

References

Details

(Whiteboard: [MemShrink])

Attachments

(2 files, 2 obsolete files)

My Windows build with --enable-dmd and --enable-release (the build only works with the patch in bug 947117 applied) crashes with this stacktrace:

 	dmd.dll!mozilla::dmd::LocationService::WriteLocation(const mozilla::dmd::Writer & aWriter={...}, const void * aPc=0xffffffff) Line 714	C++
 	dmd.dll!mozilla::dmd::StackTrace::Print(const mozilla::dmd::Writer & aWriter={...}, mozilla::dmd::LocationService * aLocService=0x155d5000) Line 857	C++
 	dmd.dll!mozilla::dmd::TraceRecord::Print(const mozilla::dmd::Writer & aWriter={...}, mozilla::dmd::LocationService * aLocService=0x155d5000, unsigned int aM=132, unsigned int aN=1233, const char * aStr=0x567db830, const char * astr=0x567db824, unsigned int aCategoryUsableSize=13784707, unsigned int aCumulativeUsableSize=8500236, unsigned int aTotalUsableSize=55711912) Line 1522	C++
 	dmd.dll!mozilla::dmd::PrintSortedRecords<mozilla::dmd::TraceRecord>(const mozilla::dmd::Writer & aWriter={...}, mozilla::dmd::LocationService * aLocService=0x155d5000, const char * aStr=0x567db830, const char * astr=0x567db824, const js::HashSet<mozilla::dmd::TraceRecord,mozilla::dmd::TraceRecord,mozilla::dmd::InfallibleAllocPolicy> & aRecordTable={...}, unsigned int aCategoryUsableSize=13784707, unsigned int aTotalUsableSize=55711912) Line 1959	C++
 	dmd.dll!mozilla::dmd::PrintSortedTraceAndFrameRecords(const mozilla::dmd::Writer & aWriter={...}, mozilla::dmd::LocationService * aLocService=0x155d5000, const char * aStr=0x567db830, const char * astr=0x567db824, const js::HashSet<mozilla::dmd::TraceRecord,mozilla::dmd::TraceRecord,mozilla::dmd::InfallibleAllocPolicy> & aTraceRecordTable={...}, unsigned int aCategoryUsableSize=13784707, unsigned int aTotalUsableSize=55711912) Line 1980	C++
 	dmd.dll!mozilla::dmd::Dump(mozilla::dmd::Writer aWriter={...}) Line 2169	C++
 	xul.dll!mozilla::dmd::ReportAndDump(JSContext * cx=0x05130c90, unsigned int argc=1, JS::Value * vp=0x0515b3c0) Line 1692	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, JS::CallArgs args={...}, js::MaybeConstruct construct=NO_CONSTRUCT) Line 456	C++
 	mozjs.dll!Interpret(JSContext * cx=0x0515b3d0, js::RunState & state={...}) Line 2505	C++
 	mozjs.dll!js::RunScript(JSContext * cx=0x05130c90, js::RunState & state={...}) Line 420	C++
 	mozjs.dll!js::ExecuteKernel(JSContext * cx=0x05130c90, JS::Handle<JSScript *> script={...}, JSObject & scopeChainArg={...}, const JS::Value & thisv={...}, js::ExecuteType type=EXECUTE_DEBUG_GLOBAL, js::AbstractFramePtr evalInFrame={...}, JS::Value * result=0x002ccfe0) Line 611	C++
 	mozjs.dll!js::EvaluateInEnv(JSContext * cx=0x05130c90, JS::Handle<JSObject *> env={...}, JS::Handle<JS::Value> thisv={...}, js::AbstractFramePtr frame={...}, JS::StableCharPtr chars={...}, unsigned int length=25, const char * filename=0x555acd70, unsigned int lineno=1, JS::MutableHandle<JS::Value> rval={...}) Line 4351	C++
 	mozjs.dll!DebuggerGenericEval(JSContext * cx=0x05130c90, const char * fullMethodName=0x555acf68, const JS::Value & code={...}, EvalBindings evalWithBindings=EvalHasExtraBindings, JS::Handle<JS::Value> bindings={...}, JS::Handle<JS::Value> options={...}, JS::MutableHandle<JS::Value> vp={...}, js::Debugger * dbg=0x1370f900, JS::Handle<JSObject *> scope={...}, js::ScriptFrameIter * iter=0x00000000) Line 4480	C++
 	mozjs.dll!DebuggerObject_evalInGlobalWithBindings(JSContext * cx=0x05130c90, unsigned int argc=85308240, JS::Value * vp=0x0515b330) Line 5312	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, JS::CallArgs args={...}, js::MaybeConstruct construct=NO_CONSTRUCT) Line 456	C++
 	mozjs.dll!Interpret(JSContext * cx=0x0515b340, js::RunState & state={...}) Line 2505	C++
 	mozjs.dll!js::RunScript(JSContext * cx=0x05130c90, js::RunState & state={...}) Line 420	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, JS::CallArgs args={...}, js::MaybeConstruct construct=NO_CONSTRUCT) Line 482	C++
 	mozjs.dll!js::CallOrConstructBoundFunction(JSContext * cx=0x05130c90, unsigned int argc=2, JS::Value * vp=0x0515b168) Line 1274	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, JS::CallArgs args={...}, js::MaybeConstruct construct=NO_CONSTRUCT) Line 456	C++
 	mozjs.dll!Interpret(JSContext * cx=0x0515b178, js::RunState & state={...}) Line 2505	C++
 	mozjs.dll!js::RunScript(JSContext * cx=0x05130c90, js::RunState & state={...}) Line 420	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, JS::CallArgs args={...}, js::MaybeConstruct construct=NO_CONSTRUCT) Line 482	C++
 	mozjs.dll!js_fun_apply(JSContext * cx=0x05130c90, unsigned int argc=2, JS::Value * vp=0x0515b058) Line 1046	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, JS::CallArgs args={...}, js::MaybeConstruct construct=NO_CONSTRUCT) Line 456	C++
 	mozjs.dll!Interpret(JSContext * cx=0x0515b068, js::RunState & state={...}) Line 2505	C++
 	mozjs.dll!js::RunScript(JSContext * cx=0x05130c90, js::RunState & state={...}) Line 420	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, JS::CallArgs args={...}, js::MaybeConstruct construct=NO_CONSTRUCT) Line 482	C++
 	mozjs.dll!js::Invoke(JSContext * cx=0x05130c90, const JS::Value & thisv={...}, const JS::Value & fval={...}, unsigned int argc=0, JS::Value * argv=0x002cef54, JS::MutableHandle<JS::Value> rval={...}) Line 513	C++
 	mozjs.dll!JS_CallFunctionValue(JSContext * cx=0x05130c90, JSObject * objArg=0x13b09320, JS::Value fval={...}, unsigned int argc=0, JS::Value * argv=0x002cef54, JS::Value * rval=0x002cee78) Line 4990	C++
 	xul.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS * wrapper=0x137530c0, unsigned short methodIndex=3, const XPTMethodDescriptor * info_=0x01015e90, nsXPTCMiniVariant * nativeParams=0x002cf0d0) Line 1414	C++
 	xul.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex=3, const XPTMethodDescriptor * info=0x01015e90, nsXPTCMiniVariant * params=0x002cf0d0) Line 484	C++
 	xul.dll!PrepareAndDispatch(nsXPTCStubBase * self=0x15323520, unsigned int methodIndex=3, unsigned int * args=0x002cf17c, unsigned int * stackBytesToPop=0x002cf16c) Line 85	C++
 	xul.dll!SharedStub() Line 113	C++
 	xul.dll!nsThread::ProcessNextEvent(bool mayWait=false, bool * result=0x002cf219) Line 612	C++
 	xul.dll!NS_ProcessNextEvent(nsIThread * thread=0x00ada220, bool mayWait=false) Line 263	C++
 	xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate=0x00a4a0c0) Line 85	C++
 	xul.dll!MessageLoop::RunInternal() Line 223	C++
 	xul.dll!MessageLoop::RunHandler() Line 216	C++
 	xul.dll!MessageLoop::Run() Line 190	C++
 	xul.dll!nsBaseAppShell::Run() Line 163	C++
 	xul.dll!nsAppShell::Run() Line 113	C++
 	xul.dll!nsAppStartup::Run() Line 268	C++
 	xul.dll!XREMain::XRE_mainRun() Line 3978	C++
 	xul.dll!XREMain::XRE_main(int argc=3, char * * argv=0x0036e1c8, const nsXREAppData * aAppData=0x002cf754) Line 4046	C++
 	xul.dll!XRE_main(int argc=3, char * * argv=0x0036e1c8, const nsXREAppData * aAppData=0x002cf754, unsigned int aFlags=0) Line 4254	C++
 	firefox.exe!do_main(int argc=3, char * * argv=0x0036e1c8, nsIFile * xreDirectory=0x00ada0a0) Line 280	C++
 	firefox.exe!NS_internal_main(int argc=3, char * * argv=0x0036e1c8) Line 648	C++
 	firefox.exe!wmain(int argc=3, wchar_t * * argv=0x0036a7e0) Line 105	C++
 	firefox.exe!__tmainCRTStartup() Line 533	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown
I diagnosed the crash in the debugger and have a quite clear idea why it happens.

entry.mLibrary is NULL so entry.mLibrary[0] crashes. The contract for entry.mLibrary is that it should never be NULL since it gets filled a few lines up by:
entry.Replace(aPc, details.function, library, details.loffset, details.filename, details.lineno);
and library is not NULL when calling Replace.

However, the Replace line only executes if (entry.mPc != aPc). What happens when it crashes is that entry.mPc == Entry::kUnused == 0xffffffff (normal, indicates an unused entry) and aPc, that comes from nsStackWalk is also 0xffffffff. The two are equal (in debug that would trigger the MOZ_ASSERT but in release builds it doesn't), Replace is never executed so entry.mLibrary remains NULL.

The question is why does nsStackWalk push 0xffffffff as the address where code is executed from. The 0xffffffff pushed to DMD ultimately comes from frame64.AddrPC.Offset in WalkStackMain64 in nsStackWalk.cpp. 

This line: frame64.AddrPC.Offset    = context.Eip; never produces 0xffffffff which makes sense since it's an impossible value for Eip. But a bit later in WalkStackMain frame64 is passed into StackWalk64 and I have verified in the debugger that there are cases when StackWalk64 changes frame64.AddrPC.Offset from a reasonable value to 0xffffffffffffffff which then gets truncated to 0xffffffff when doing data->pcs[data->pc_count] = (void*)addr;

For example, in one case where frame64.AddrPC.Offset comes out of StackWalk64 as 0xffffffffffffffff its input value is 0x52A32CDA which I verified is somewhere inside xul.dll!mozilla::crashreporter::LSPAnnotationGatherer::Run(). And the thread whose stack is being walked shows the following stacktrace in the debugger so everything seems ok yet somehow, StackWalk64 isn't able to walk from mozilla::crashreporter::LSPAnnotationGatherer::Run() to its nsThread::ProcessNextEvent caller

 	ntdll.dll!_NtSignalAndWaitForSingleObject@16()	Unknown
 	ntdll.dll!_NtSignalAndWaitForSingleObject@16()	Unknown
 	user32.dll!_NtUserPostThreadMessage@16()	Unknown
 	dmd.dll!NS_StackWalk(void (void *, void *, void *) * aCallback=0x54dd6a70, unsigned int aSkipFrames=0x00000002, unsigned int aMaxFrames=0x00000018, void * aClosure=0x06d3f998, unsigned int aThread=0x00000000, void * aPlatformData=0x00000000) Line 620	C++
 	dmd.dll!mozilla::dmd::StackTrace::Get(mozilla::dmd::Thread * aT=0x00a021bc) Line 879	C++
 	dmd.dll!mozilla::dmd::AllocCallback(void * aPtr=0x054b3000, unsigned int aReqSize=0x00001afc, mozilla::dmd::Thread * aT=0x00a021bc) Line 1184	C++
 	dmd.dll!replace_malloc(unsigned int aSize=0x00001afc) Line 1245	C++
 	mozglue.dll!malloc_impl(unsigned int size=0x00001afc) Line 151	C
 	mozalloc.dll!moz_xmalloc(unsigned int size=0x00001afc) Line 52	C++
 	xul.dll!operator new(unsigned int size=0x00001afc) Line 201	C++
>	xul.dll!mozilla::crashreporter::LSPAnnotationGatherer::Run() Line 66	C++
 	xul.dll!nsThread::ProcessNextEvent(bool mayWait=true, bool * result=0x06d3fdbd) Line 612	C++
 	xul.dll!NS_ProcessNextEvent(nsIThread * thread=0x00ada2e0, bool mayWait=true) Line 263	C++
 	xul.dll!nsThread::ThreadFunc(void * arg=0x00ada2e0) Line 246	C++
 	nss3.dll!_PR_NativeRunThread(void * arg=0x054ad4c0) Line 419	C
 	nss3.dll!pr_root(void * arg=0x00ad23a0) Line 90	C
 	msvcr110.dll!_callthreadstartex() Line 354	C
 	msvcr110.dll!_threadstartex(void * ptd=0x057781d8) Line 332	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown


I'm using the system dbghelp.dll which is version 6.1.7601.17514.

I don't really know what to do with this: either we're using StackWalk64 wrong somehow and it returns the unfortunate 0xffffffffffffffff because of that wrong usage or we simply need to accept that StackWalk64 will not always give meaningful results in which case DMD needs to remove the assumption that 0xffffffff is special and can never be produced by the stack walking code.

Nick, do you know about Windows experts that might help with StackWalk64?
Great analysis, Catalin.

> Nick, do you know about Windows experts that might help with StackWalk64?

No.  All that stack-walking code is pretty hairy...

> DMD needs to remove the
> assumption that 0xffffffff is special and can never be produced by the stack
> walking code.

...so that seems fine to me, and the easiest way forward.  You should be able to steal a bit from |mLineNo| in order to create a new |mIsUsed| field.

But now I'm wondering why this assertion doesn't fire in debug builds:

    MOZ_ASSERT(aPc != Entry::kUnused);

Do the 0xffffffff entries only occur in opt builds?
Do you have --enable-profiling in your .mozconfig?  We are unable to reliably walk the stacks on Windows optimized builds unless that option is set.  We should hide any code which expects this to work behind #ifdef MOZ_PROFILING.
Whiteboard: [MemShrink]
The patch is straightforward.  Catalin, I used to ask jlebar for DMD reviews.
But he's gone now, which means that you know this code as well as anyone other
than me :)
Attachment #8346979 - Flags: review?(iacobcatalin)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
BTW, we'll have to address the MOZ_PROFILING issue in comment 3, but my patch is virtuous anyway -- it's better to not have any distinguished PC values that could collide with real PC values.
This is one way of doing this.  But it might be better (though harder) to do it
within configure.in.
Attachment #8346983 - Flags: review?(ehsan)
Comment on attachment 8346983 [details] [diff] [review]
(part 2) - Abort DMD-enabled Windows builds unless --enable-profiling is specified.

Review of attachment 8346983 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the nit-picking here, but this will save somebody someday!

::: memory/replace/dmd/DMD.cpp
@@ +15,5 @@
>  #include <string.h>
>  
>  #ifdef XP_WIN
> +#if !defined(DEBUG) && !defined(MOZ_PROFILING)
> +#error "Non-debug DMD-enabled builds on Windows must be built with --enable-profiling"

This is a bit imprecise.  The thing that we're really looking for here is to see whether the build is not-optimized or whether profiling is turned on (see this logic: <http://mxr.mozilla.org/mozilla-central/source/build/autoconf/frameptr.m4#31>).  We don't have a preprocessor flag which is turned on for non-optimized builds, but it's easy enough.  Please add this in the moz.build:

if CONFIG['MOZ_OPTIMIZE']:
    DEFINES['MOZ_OPTIMIZE'] = True

And then, you can test for:

#if defined(MOZ_OPTIMIZE) && !defined(MOZ_PROFILING)
Attachment #8346983 - Flags: review?(ehsan) → review-
Comment on attachment 8346979 [details] [diff] [review]
DMD: Handle arbitrary PC values in stack frames.

Review of attachment 8346979 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/replace/dmd/DMD.cpp
@@ +606,5 @@
>                              //   in a non-empty entry is in use
>      ptrdiff_t   mLOffset;
>      char*       mFileName;  // owned by the Entry; may be null
> +    uint32_t    mLineNo:31;
> +    uint32_t    mInUse;     // is the entry used?

Did you forget a :1 here?
This version has the missing ":1" that Ehsan spotted.
Attachment #8346998 - Flags: review?(iacobcatalin)
Attachment #8346979 - Attachment is obsolete: true
Attachment #8346979 - Flags: review?(iacobcatalin)
Attachment #8346983 - Attachment is obsolete: true
Comment on attachment 8347001 [details] [diff] [review]
(part 2) - Abort DMD-enabled Windows builds unless --enable-profiling is specified.

Review of attachment 8347001 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/replace/dmd/DMD.cpp
@@ +15,5 @@
>  #include <string.h>
>  
>  #ifdef XP_WIN
> +#if defined(MOZ_OPTIMIZE) && !defined(MOZ_PROFILING)
> +#error "Non-debug DMD-enabled builds on Windows must be built with --enable-profiling"

Nit: "Non-debug" should be "optimized".
Attachment #8347001 - Flags: review?(ehsan) → review+
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #3)
> Do you have --enable-profiling in your .mozconfig?  We are unable to
> reliably walk the stacks on Windows optimized builds unless that option is
> set.

I didn't have it. And indeed it helps with walking the stack. With Nick's patch to fix DMD if I build with --enable-release and without --enable-profiling the crash is fixed (as expected) but the stacktraces don't look too well (a lot of them are very short for example). Adding --enable-profiling improves the stacktraces a lot.

So with both patches applied this bug is fixed.
Comment on attachment 8346998 [details] [diff] [review]
DMD: Handle arbitrary PC values in stack frames.

Review of attachment 8346998 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good and fixes the crash. With this patch, --enable-dmd, --enable-release and --enable-profiling I get good stacktraces.

I don't know this patch if it's really really needed (maybe just --enable-profiling would make the stack walking code not produce the problematic 0xffffffff) but it doesn't matter since the code looks better and is less fragile with the patch.
Attachment #8346998 - Flags: review?(iacobcatalin) → review+
Thanks, Catalin.  Does the --enable-release flag have any effect here?  Is it necessary, or are you just mentioning it for completeness?
https://hg.mozilla.org/mozilla-central/rev/8a254244f3db
https://hg.mozilla.org/mozilla-central/rev/4afced469f9d
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
(In reply to Nicholas Nethercote [:njn] from comment #14)
> Thanks, Catalin.  Does the --enable-release flag have any effect here?  Is
> it necessary, or are you just mentioning it for completeness?

I mentioned --enable-release because that's what I used to reproduce this bug and get an optimized build. 

--enable-release together with --enable-dmd produced the linking error you saw in https://bugzilla.mozilla.org/show_bug.cgi?id=936784#c40 and which lead to this bug. I took the --enable-release from the try build at https://tbpl.mozilla.org/php/getParsedLog.php?id=31460041&full=1&branch=try . This is the one that gave you the linking error.

In the mean time I found out there's an --enable-optimized which I suppose is implied by --enable-release (since the official Mozilla nigthlies have --enable-release but don't have --enable-optimized in about:buildconfig). At this point it's not clear to me what exactly do people refer to when saying "optimized build" so I mentioned --enable-release to be unambiguous.
FWIW I filed bug 950856 to make sure that NS_StackWalk will only be available where we really support it on Windows.
> In the mean time I found out there's an --enable-optimized which I suppose
> is implied by --enable-release (since the official Mozilla nigthlies have
> --enable-release but don't have --enable-optimized in about:buildconfig). At
> this point it's not clear to me what exactly do people refer to when saying
> "optimized build" so I mentioned --enable-release to be unambiguous.

Ah.  "Optimized build" usually means that --enable-optimize was specified (or implied) and --enable-debug was not specified.  "Debug build" means --enable-debug was specified (and --enable-optimize may or may not be specified or implied).

I updated the Windows build instructions at https://wiki.mozilla.org/Performance/MemShrink/DMD#Everything_other_than_B2G-device_builds, so hopefully they are correct, i.e. --enable-dmd and --enable-profiling are all that is necessary (and the user can choose --enable-optimize and/or --enable-debug as they wish).  Ehsan, is that right?
Flags: needinfo?(ehsan)
(In reply to Nicholas Nethercote [:njn] from comment #20)
> > In the mean time I found out there's an --enable-optimized which I suppose
> > is implied by --enable-release (since the official Mozilla nigthlies have
> > --enable-release but don't have --enable-optimized in about:buildconfig). At
> > this point it's not clear to me what exactly do people refer to when saying
> > "optimized build" so I mentioned --enable-release to be unambiguous.
> 
> Ah.  "Optimized build" usually means that --enable-optimize was specified
> (or implied) and --enable-debug was not specified.  "Debug build" means
> --enable-debug was specified (and --enable-optimize may or may not be
> specified or implied).

Note that is only a common definition.  There is nothing which makes --enable-debug and --enable-optimize incompatible (in fact, the "debug" builds on TBPL have both set, which means that they have additional debug only checks at runtime, and they also use compiler optimizations.)  The key point here is that we instruct the compiler to not emit frame pointers if --enable-optimize is used without --enable-profiling.  Sorry if this all is a bit contrived.  :-)

> I updated the Windows build instructions at
> https://wiki.mozilla.org/Performance/MemShrink/DMD#Everything_other_than_B2G-
> device_builds, so hopefully they are correct, i.e. --enable-dmd and
> --enable-profiling are all that is necessary (and the user can choose
> --enable-optimize and/or --enable-debug as they wish).  Ehsan, is that right?

Yes, that looks good.
Flags: needinfo?(ehsan)
You need to log in before you can comment on or make changes to this bug.