jits totally break SEH on Win64, including SetUnhandledExceptionFilter

RESOLVED FIXED in mozilla35

Status

()

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: luke, Unassigned)

Tracking

unspecified
mozilla35
x86_64
Windows 7
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [js:t])

Attachments

(6 attachments, 5 obsolete attachments)

(Reporter)

Description

6 years ago
I discovered, while adding Win64 signal handler support in Odin, that Win64 makes some pretty incredible demands if you want SEH to work at all [1]:

  "For dynamically generated functions [JIT compilers], the runtime to support these functions must either use RtlInstallFunctionTableCallback or RtlAddFunctionTable to provide this information to the operating system."

The JIT totally doesn't do this.  Now, you might expect this to break __try/__except/__finally, but it also breaks the "unhandled exception filter", which we do use, mostly in breakpad, but also that FP exception fix gal made a few years ago [2].  (It looks like SPS uses SuspendThread/GetThreadContext which is great.)  What I don't know is if breakpad has some fallback that still spits out a crash report if the unhandled exception filter isn't called.

Fortunately, Win64 is only shipped as a nightly build; it's not a tier1 platform, so this isn't a real priority.

Btw, an alternative to SEH is the (new with WinXP!) "vectored exception handlers" which don't do any of this stack funny business and thus works fine with jits; they're like signal handlers that automatically chain [3].

[1] http://msdn.microsoft.com/en-us/library/ft9x1kdx.aspx
[2] http://mxr.mozilla.org/mozilla-central/search?string=SetUnhandledExceptionFilter
[3] http://msdn.microsoft.com/en-us/library/windows/desktop/ms681420%28v=vs.85%29.aspx
Breakpad doesn't have any fallbacks if SetUnhandledExceptionFilter doesn't work on Windows. Sounds like we could plausibly use vectored exception handlers and make everything work.
(Reporter)

Comment 2

6 years ago
Indeed.  The only potential problem is that the vectored exception handlers run before all normal SEH handlers which means that breakpad might crash before some SEH handler got a chance.  OTOH, given that SEH doesn't work at all because of jits, we should probably have some FF-wide rule "no SEH": only vectored exception handlers.
OTOH, if you did support this, then adding support for unwinding from JIT frames on Linux (probably Mac, too?) and GDB support (http://sourceware.org/gdb/current/onlinedocs/gdb/JIT-Interface.html#JIT-Interface) would be fairly straightforward too.

But I agree this is not a major priority.
Whiteboard: [js:t]

Comment 4

6 years ago
Then disable the JIT until it is fixed.
How about a vectored continue handler? From some brief debugging, it seems that they work like vectored exception handlers except they get called *after* the SEH walk, even if JIT code sends the lookup astray.
Blocks: 1024272
This lets Breakpad run and makes bug 1024272 pass, but annoyingly, it still pops up my default postmortem debugger afterwards. I don't know whether that's an inherent part of the design or if it's something we can control.

Breakpad forwards to the FPEFilter, so that ought to have a say (would need need to test to verify). Are there any other such filters that we need to worry about?
Attachment #8452827 - Flags: feedback?(luke)
Attachment #8452827 - Attachment is patch: true

Comment 7

5 years ago
Given just your description in the bug, don't we want a VectoredExceptionHandler and *not* a VectoredContinueHandler? In particular 3rd-party code might be using SEH handlers correctly, and we don't want to intercept that if there is a valid __except block.
(Reporter)

Comment 8

5 years ago
One problem is that vectored continue handlers appear to be newer than WinXP (though MSDN shows they are in 64-bit XP); vectored exception handlers were in WinXP (and are used by asm.js/Ion without any reported issues).

Because JITs make SEH inherently unreliable on Win64, I think it would be safer to just say you can't use SEH.  That is, it'd be lame if your __except block only gets called depending on whether there is JIT code on the stack.  Do you know if FF has any important uses of it?  mxr shows some hits, but many are in breakpad/sandboxing, so I don't know if they're used.

Comment 9

5 years ago
We have one very specific case in libjar where we're reading memory-mapped files and use SEH to catch I/O errors. We also use SEH to catch exceptions thrown by 3rd-party code which otherwise might get eaten by wow64.

I know that some 3rd-party code uses SEH for their own purposes, but it shouldn't escape past their boundaries.

But no, I don't think there is ever a case where SEH would pass through JIT code and we would expect to recover.

And for the purposes of this bug, we only care about win64 on vista+, and maybe only win7+. XP64 is not a target.
(Reporter)

Comment 10

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> But no, I don't think there is ever a case where SEH would pass through JIT
> code and we would expect to recover.

Well that's encouraging.  One question I had is whether JIT code above the __try (stack growing down) would mess up SEH, but that'd be crazy.

It'd be great if we could assert that JIT code never executed inside an SEH block.  It looks like we already have a MOZ_SEH_TRY.  This could push an RAII guard that sets some thread-local flag that we could then test in AutoJSAPI (which should be a pinch-point for all JSAPI usage in Gecko).

> And for the purposes of this bug, we only care about win64 on vista+, and
> maybe only win7+. XP64 is not a target.

w00t
(Reporter)

Updated

5 years ago
Attachment #8452827 - Flags: feedback?(luke) → feedback+
This is the last remaining test failure for win64 so it would be great to get this resolved, but I really need to focus on other work right now.

If anyone wants to pick it up in the meantime, here's what I think needs to be done:

* Add an entry for VectoredContinueHandler to |enum HandlerType| in exception_handler.h
* In ExceptionHandler::Initialize, register a VectoredContinueHandler that basically does the same thing as ExceptionHandler::HandleException, except that we don't need to defer to previous_filter_ (because the OS will defer to the other vectored handlers automatically)
* Do this only on 64-bit since the API is not available on 32-bit WinXP
* Clean up in ExceptionHandler::~ExceptionHandler
* The RAII enforcements from comment 10

(FWIW, XP64 is actually built from the Server 2003 SP1 codebase)
Assignee: general → nobody
I have a patch implementing most of David's suggestions from comment 11 (I think) that compiles just fine on try:

https://tbpl.mozilla.org/?tree=Try&rev=f501fb0f4643

But it doesn't run any tests, and thus is relatively useless for telling me if I did things correctly.  David, can you instruct me on how to run Win64 tests?

I'm sure a Windows expert can tell me nine different ways my patch is wrong, so feel free to have at it!
Flags: needinfo?(dmajor)
Awesome! I owe you a beverage.

Feel free to use the Date branch as your Try by pushing this and backing out. Only the "Windows 8 Opt" and "Windows 8 Debug" lines matter for purposes of this bug -- beware that they show both x86 and x64 test results on the same line.

When you're ready you'll want to work with Ted to get this into upstream Breakpad.
Flags: needinfo?(dmajor)
    1.58 +LONG ExceptionHandler::VectoredContinueHandleException(EXCEPTION_POINTERS* exinfo) {
    1.59 +  AutoExceptionHandler auto_exception_handler;

I'm not sure whether the AutoExceptionHandler is needed here. Probably not doing much harm though.

    1.33      if (handler_types_ & HANDLER_EXCEPTION)
    1.34 +#ifdef _WIN64
    1.35 +      if (handler_types_ & HANDLER_VECTORED_CONTINUE)
    1.36 +	RemoveVectoredContinueHandler(VectoredContinueHandleException);
    1.37 +#endif // _WIN64
    1.38 +
    1.39        SetUnhandledExceptionFilter(previous_filter_);

Yikes!
Well, that patch certainly made things a lot worse:

https://tbpl.mozilla.org/?tree=Date&rev=a6f0d34c3de4

Looks like we're crashing while dealing with JIT code in numerous different ways.

I'm not a JIT expert, but ISTR that the JIT catches page faults in JIT code as a way of patching things up if need be.  And I'd also guess that the new handler we're installing catches those page faults, since it gets to run first, rather than before whatever the JIT is doing.  So maybe we have to let the JIT have the first crack at this with its own vectored continue handler?

Luke, does any of that sound plausible?
Flags: needinfo?(luke)
(Reporter)

Comment 16

5 years ago
Assuming vectored continue handlers happen after SEH and vectored exception handlers happen before SEH, it seems like we should be fine.  Looking at your try results, I can see tests/dom/asmjscache/test/test_slow.html succeeding and this test should be hitting the fault handlers.  All the crashes are in the Ion (non-asm.js) use of fault handlers.

What happens if you run one of these builds locally and execute the script 'while(true){}'?  That should hit the Ion signal handler path (make sure you have MOZ_CRASHREPORTER set).  It'd also be good to write a tiny test program that has all three handlers (exception, __finally, continue) and make sure they are executed in that order.
Flags: needinfo?(luke)
(In reply to Luke Wagner [:luke] from comment #16)
> Assuming vectored continue handlers happen after SEH and vectored exception
> handlers happen before SEH, it seems like we should be fine.

Hm, I see that comment 5 supports this line of reasoning.  But http://msdn.microsoft.com/en-us/library/windows/desktop/ms681420%28v=vs.85%29.aspx says:

"Vectored handlers are called in the order that they were added, after the debugger gets a first chance notification, but before the system begins unwinding the stack."

I can't find any documentation that discusses the difference between continue and exception handlers.  I'll keep digging, though...
(Reporter)

Comment 18

5 years ago
Yeah, I found the documentation in this area sorely lacking (esp the continue handlers).  I'd start with just the shell experiment to see if this even works at all.
Yeah there's not a lot of documentation on this stuff. VectoredContinueHandlers are a newer feature than VectoredExceptionHandlers. It's possible they bolted on to the existing documentation that was written only about VEH.

I arrived at comment 5 by trial and error plus single-stepping through the NT exception dispatcher. It ought to be: VEH, SEH, VCH, in that order. I'm really surprised that it's interfering with JS handlers.
(Reporter)

Comment 20

5 years ago
Does <script>while(true){}</script> correctly bring up the slow-script dialog (and successfully exit) in a patched build?
(In reply to Luke Wagner [:luke] from comment #20)
> Does <script>while(true){}</script> correctly bring up the slow-script
> dialog (and successfully exit) in a patched build?

No, it does not.  It crashes the browser.  An unpatched build correctly brings up the slow-script dialog.
Flags: needinfo?(luke)
(Reporter)

Comment 22

5 years ago
Oh great, so at least we have a simple test-case.  Can you put printfs in your vectored-continue handler as well as in AsmJSExceptionHandler (in js/src/asmjs/AsmJSSignalHandler.cpp; it's actually not asm.js-specific anymore) and see which one fires first?
Flags: needinfo?(luke)
I haven't done the printfs yet, I haven't worked up the courage to try and setup a win64 compilation environment on my Windows machine.  I'll try doing that tomorrow, and/or continue using Date as my own personal Win64 Try.

I did do some more research.  Nobody seems to want to talk about vectored continue handlers; everybody wants to discuss vectored exception handlers.  Not a good sign.  This blog post by someone at Microsoft is even more worrisome: the vectored continue handlers don't even get executed in some versions of Windows (!):

http://blogs.msdn.com/b/zhanli/archive/2010/06/25/c-tips-addvectoredexceptionhandler-addvectoredcontinuehandler-and-setunhandledexceptionfilter.aspx

I redid the patch, realizing EXCEPTION_CONTINUE_EXECUTION was really a stupid idea if we succeed in making a crash dump, since we didn't do anything to fixup the problem that caused the exception in the first place, and that we should always return EXCEPTION_CONTINUE_SEARCH, since the process is probably going to fall over anyway.  But that didn't make things any better:

https://tbpl.mozilla.org/?tree=Date&rev=464c8b5a1f28

All the failures look to be variations on:

17:00:56     INFO -  CPU: amd64
17:00:56     INFO -       family 6 model 30 stepping 5
17:00:56     INFO -       8 CPUs
17:00:56     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
17:00:56     INFO -  Crash address: 0x2c02f88
17:00:56     INFO -  Thread 0 (crashed)
17:00:56     INFO -   0  mozjs.dll!js::jit::Assembler::TraceJumpRelocations(JSTracer *,js::jit::JitCode *,js::jit::CompactBufferReader &) [Assembler-x64.cpp:464c8b5a1f28 : 260 + 0x7a]
17:00:56     INFO -      rbx = 0x000000000b208550   r12 = 0x000007fc8a89e910
17:00:56     INFO -      r13 = 0x0000000000000000   r14 = 0x00000000004bf078
17:00:56     INFO -      r15 = 0x00007fffffffffff   rip = 0x000007fc8a83ae24
17:00:56     INFO -      rsp = 0x00000000004beeb0   rbp = 0x000000000b208550
17:00:56     INFO -      Found by: given as instruction pointer in context
17:00:56     INFO -   1  mozjs.dll!js::jit::JitCode::trace(JSTracer *) [Ion.cpp:464c8b5a1f28 : 767 + 0x24]
17:00:56     INFO -      rbx = 0x000000000b208550   r12 = 0x000007fc8a89e910
17:00:56     INFO -      r13 = 0x0000000000000000   r14 = 0x00000000004bf078
17:00:56     INFO -      r15 = 0x00007fffffffffff   rip = 0x000007fc8a580ed6
17:00:56     INFO -      rsp = 0x00000000004bef10   rbp = 0x000000000b208550
17:00:56     INFO -      Found by: call frame info
17:00:56     INFO -   2  mozjs.dll!js::GCMarker::processMarkStackOther(unsigned __int64,unsigned __int64) [Marking.cpp:464c8b5a1f28 : 1647 + 0xa]
17:00:56     INFO -      rbx = 0x000000000b208550   r12 = 0x000007fc8a89e910
17:00:56     INFO -      r13 = 0x0000000000000000   r14 = 0x00000000004bf078
17:00:56     INFO -      r15 = 0x00007fffffffffff   rip = 0x000007fc8a49d5c2
17:00:56     INFO -      rsp = 0x00000000004bef50   rbp = 0x000000000b208550
17:00:56     INFO -      Found by: call frame info
17:00:56     INFO -   3  mozjs.dll!js::GCMarker::processMarkStackTop(js::SliceBudget &) [Marking.cpp:464c8b5a1f28 : 1685 + 0xa]
17:00:56     INFO -      rbx = 0x000000000b208550   r12 = 0x000007fc8a89e910
17:00:56     INFO -      r13 = 0x0000000000000000   r14 = 0x00000000004bf078
17:00:56     INFO -      r15 = 0x00007fffffffffff   rip = 0x000007fc8a49e44a
17:00:56     INFO -      rsp = 0x00000000004bef90   rbp = 0x000000000b208550
17:00:56     INFO -      Found by: call frame info
17:00:56     INFO -   4  mozjs.dll!js::GCMarker::drainMarkStack(js::SliceBudget &) [Marking.cpp:464c8b5a1f28 : 1810 + 0xa]
17:00:56     INFO -      rbx = 0x000000000b208550   r12 = 0x000007fc8a89e910
17:00:56     INFO -      r13 = 0x0000000000000000   r14 = 0x00000000004bf078
17:00:56     INFO -      r15 = 0x00007fffffffffff   rip = 0x000007fc8a49e51b
17:00:56     INFO -      rsp = 0x00000000004bf000   rbp = 0x000000000b208550
17:00:56     INFO -      Found by: call frame info
17:00:56     INFO -   5  mozjs.dll!js::gc::GCRuntime::incrementalCollectSlice(__int64,JS::gcreason::Reason) [jsgc.cpp:464c8b5a1f28 : 5336 + 0x3b]
17:00:56     INFO -      rbx = 0x000000000b208550   r12 = 0x000007fc8a89e910
17:00:56     INFO -      r13 = 0x0000000000000000   r14 = 0x00000000004bf078
17:00:56     INFO -      r15 = 0x00007fffffffffff   rip = 0x000007fc8a9ea936
17:00:56     INFO -      rsp = 0x00000000004bf040   rbp = 0x000000000b208550
17:00:56     INFO -      Found by: call frame info

with an occasional:

17:39:06     INFO -  CPU: amd64
17:39:06     INFO -       family 6 model 30 stepping 5
17:39:06     INFO -       8 CPUs
17:39:06     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_WRITE
17:39:06     INFO -  Crash address: 0x24a366e
17:39:06     INFO -  Thread 0 (crashed)
17:39:06     INFO -   0  mozjs.dll!js::jit::JitRuntime::patchIonBackedges(JSRuntime *,js::jit::JitRuntime::BackedgeTarget) [Ion.cpp:464c8b5a1f28 : 443 + 0x4f]
17:39:06     INFO -      rbx = 0x0000000018b23c88   r12 = 0x0000000000000008
17:39:06     INFO -      r13 = 0x00000000025e16b8   r14 = 0x0000000000000000
17:39:06     INFO -      r15 = 0x0000000000000000   rip = 0x000007fbcce57158
17:39:06     INFO -      rsp = 0x0000000000849a40   rbp = 0x00000000024a3672
17:39:06     INFO -      Found by: given as instruction pointer in context
17:39:06     INFO -   1  mozjs.dll!js::jit::InterruptCheck(JSContext *) [VMFunctions.cpp:464c8b5a1f28 : 577 + 0x14]
17:39:06     INFO -      rbx = 0x0000000018b23c88   r12 = 0x0000000000000008
17:39:06     INFO -      r13 = 0x00000000025e16b8   r14 = 0x0000000000000000
17:39:06     INFO -      r15 = 0x0000000000000000   rip = 0x000007fbccf898de
17:39:06     INFO -      rsp = 0x0000000000849aa0   rbp = 0x00000000024a3672
17:39:06     INFO -      Found by: call frame info
17:39:06     INFO -   2  0xd3b497
17:39:06     INFO -      rbx = 0x0000000018b23c88   r12 = 0x0000000000000008
17:39:06     INFO -      r13 = 0x00000000025e16b8   r14 = 0x0000000000000000
17:39:06     INFO -      r15 = 0x0000000000000000   rip = 0x0000000000d3b498
17:39:06     INFO -      rsp = 0x0000000000849ad0   rbp = 0x00000000024a3672

I'll poke at this some more tomorrow.  If anybody has VMWare images of a 64-bit Win7 machine with full mozilla+win64 dev environment installed, that'd be fantastic, too.
FWIW, if you already have a Win64 machine with Visual Studio 2010/2013 installed on it (and Windows SDK 8.1), getting a working setup should be as simple as installing the latest MozillaBuild, opening the appropriate start-shell-msvc-*-x64.bat and adding

ac_add_options --target=x86_64-pc-mingw32
ac_add_options --host=x86_64-pc-mingw32

to your .mozconfig (admittedly I didn't test it from scratch, but I only needed the mozconfig change to make it work locally).
I spent some more time with the exception dispatcher, and what I'm seeing is that VectoredContinueHandlers still get called if a VectoredExceptionHandler returns EXCEPTION_CONTINUE_EXECUTION. AsmJSExceptionHandler is working fine, but the Breakpad code still thinks it needs to write a dump and kill the process.

(Note that this is not a problem with SEH: a successful SEH will prevent VCH, but a successful VEH will not prevent VCH. Sheesh, this is messy.)

Maybe we need some kind of coordination so that we don't write a dump for exceptions that AsmJSExceptionHandler already handled?
(Reporter)

Comment 26

5 years ago
Wow, that is a mess.  If there is no way to have VEH prevent the VCH, then I think you're right, the VEH would notify breakpad when it continues execution.

What worries me is the report in comment 23 that VCH aren't always called; it'd be scary to silently drop crash reports.  Since VEH seem fairly solid, I wonder if we could:
 - register breakpad as both a VEH and the UEF
 - when the breakpad VEH is called, we don't want to dump yet if there is a live __try, so we CONTINUE_SEARCH from the VEH and rely on the assertion that we only use __try when there isn't JIT code on the stack (so we know we'll get to the UEF)

So how do we figure out if there are any live __try's?  I believe there is a stable ABI for SEH so I assume we can find the symbol for the head of the linked list.
(In reply to Luke Wagner [:luke] from comment #26)
> Wow, that is a mess.  If there is no way to have VEH prevent the VCH, then I
> think you're right, the VEH would notify breakpad when it continues
> execution.
> 
> What worries me is the report in comment 23 that VCH aren't always called;
> it'd be scary to silently drop crash reports.  Since VEH seem fairly solid,
> I wonder if we could:
>  - register breakpad as both a VEH and the UEF

I am confused on this point.  I thought part of the point of this bug is that the UEF doesn't work, because SEH doesn't work with the JIT code, per comment 0.  So we needed to find an alternative that doesn't rely on the UEF being called.

>  - when the breakpad VEH is called, we don't want to dump yet if there is a
> live __try, so we CONTINUE_SEARCH from the VEH and rely on the assertion
> that we only use __try when there isn't JIT code on the stack (so we know
> we'll get to the UEF)

Ah, so I guess this addresses my earlier point.  To spell out what I understand you to be saying:

// This handler needs to be added as the last handler, not the first one, since it
// wants to give asm.js's handler a chance to fixup any exceptions first.
BreakpadVEH(...) {
  if (!SEHHandlersOnStack) {
    // Something went horribly wrong in JIT code, or in C++ code that's not
    // prepared to handle the exception.  We can't rely on SEH and the UEF
    // to write out crash reports, because we might be in JIT code, and the
    // SEH machinery will wander off into the weeds.
    //
    // Write crash report and bail.
    ...
  } else {
    // We have SEH __try somewhere, so assume that somebody is going to handle
    // the exception...and if not, the UEF will be called.  We can have
    // confidence that JIT'd code that doesn't register itself with the unwind
    // machinery isn't running, because magic, and therefore the SEH machinery
    // won't walk off into the weeds.
    //
    // Fall through and indicate somebody else should handle this error.
  }

  // In the first case above, we want the search to continue because we want it
  // to arrive at some sort of reasonable default which winds up killing the
  // process.  In the second case above, we want the search to continue because
  // some SEH handler needs to be called, and possibly the UEF.
  return EXCEPTION_CONTINUE_SEARCH;
}

Is that correct?  Sorry for being dense here.

I am not totally convinced that Breakpad upstream would be willing to accept something like that, as it seems dependent on a number of assumptions...and let's face it, we only have to do something like this because the JS engine doesn't follow the platform's ABI.  (I think it would not be awful to make the JS engine use something like RtlInstallFunctionTableCallback http://msdn.microsoft.com/en-us/library/windows/desktop/ms680595%28v=vs.85%29.aspx ...at least assuming the MS runtime doesn't cache entries delivered by the callback function.  It's not obvious how one would invalidate those entries, though...)

> So how do we figure out if there are any live __try's?  I believe there is a
> stable ABI for SEH so I assume we can find the symbol for the head of the
> linked list.

From skimming various Windows exploit guides, it looks as though we can grovel in fs:[0x0] for whether there is an active handler or not.
Flags: needinfo?(luke)
(Reporter)

Comment 28

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #27)
> Is that correct?  Sorry for being dense here.

That's right.

We could also release-mode assert when we are about to enter JIT code that !SEHHandlersOnStack and make this a hard guarantee.

> I am not totally convinced that Breakpad upstream would be willing to accept
> something like that, as it seems dependent on a number of assumptions...and
> let's face it, we only have to do something like this because the JS engine
> doesn't follow the platform's ABI.

I don't think any JIT does that (no hits in v8/chromium source for RtlInstallFunctionTableCallback), so it would make sense as a general feature of breakpad.  It'd be interesting to ask the Google breakpad people what they did for the Win64 Chrome release.

> (I think it would not be awful to make
> the JS engine use something like RtlInstallFunctionTableCallback
> http://msdn.microsoft.com/en-us/library/windows/desktop/ms680595%28v=vs.
> 85%29.aspx ...at least assuming the MS runtime doesn't cache entries
> delivered by the callback function.  It's not obvious how one would
> invalidate those entries, though...)

This isn't just a matter of registering "I have jit code at this pc range".  Rather, since we don't maintain a frame pointer, we'd need to be able to determine, for any pc, the offset from sp to the base of the frame.  That's really difficult in the face of all the weird things the JIT does with sp.  It's also bug prone so not something you want on your crash-handling path.

I actually started down the path of maintaining stack depth info and backed away because I could see it was going to be a huge project.  Instead, we maintain (only when profiling is active) a virtual frame pointer at a fixed memory location that gets maintained by profiling prologue/epilogues.  Kannan revisited this question in the context of general Ion and came to the same conclusion.

> > So how do we figure out if there are any live __try's?  I believe there is a
> > stable ABI for SEH so I assume we can find the symbol for the head of the
> > linked list.
> 
> From skimming various Windows exploit guides, it looks as though we can
> grovel in fs:[0x0] for whether there is an active handler or not.

Great!  To confirm, you could just compile a test case with __try/__finally and step through the prologue/epilogue.
Flags: needinfo?(luke)
(Reporter)

Comment 29

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #27)
> >  - register breakpad as both a VEH and the UEF
> 
> I am confused on this point.  I thought part of the point of this bug is
> that the UEF doesn't work, because SEH doesn't work with the JIT code, per
> comment 0.  So we needed to find an alternative that doesn't rely on the UEF
> being called.

Just to be clear: the VEH would catch crashes inside JIT code; the UEF catches them when there is no JIT code on the stack but frame-based EH is used.
(In reply to Nathan Froyd (:froydnj) from comment #27)
> From skimming various Windows exploit guides, it looks as though we can
> grovel in fs:[0x0] for whether there is an active handler or not.

That doesn't work on 64-bit Windows, methinks.
(In reply to Aaron Klotz [:aklotz] from comment #30)
> (In reply to Nathan Froyd (:froydnj) from comment #27)
> > From skimming various Windows exploit guides, it looks as though we can
> > grovel in fs:[0x0] for whether there is an active handler or not.
> 
> That doesn't work on 64-bit Windows, methinks.

Good point!  Wikipedia says the 64-bit version is at gs:[0x0].

Light testing on x86 suggests that for functions with __try, the SEH region seems to cover the whole function (!); maybe things work as one might expect for nested __try, or perhaps I am doing something wrong.  The same light testing indicates that there is at least one SEH entry pushed prior to entering main(), so Breakpad will likely have to stash away the top-of-SEH-stack when the ExceptionHandler is created, and use that saved value for checking whether there is an active SEH in progress.

Still in process of setting up a proper 64-bit environment.  Hooray for VS 2010 Express not having 64-bit stuff (or at least I have not convinced it to install such), and 2013 Express only running on Windows 8.

Comment 32

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #31)
> Still in process of setting up a proper 64-bit environment.  Hooray for VS
> 2010 Express not having 64-bit stuff (or at least I have not convinced it to
> install such), and 2013 Express only running on Windows 8.
VS2013 Express for Windows *Desktop* runs on Win7. Firefox Metro is dead, right?

Comment 33

5 years ago
(In reply to Luke Wagner [:luke] from comment #28)
> This isn't just a matter of registering "I have jit code at this pc range". 
> Rather, since we don't maintain a frame pointer, we'd need to be able to
> determine, for any pc, the offset from sp to the base of the frame.  That's
> really difficult in the face of all the weird things the JIT does with sp. 
> It's also bug prone so not something you want on your crash-handling path.
FYI, IE11's jscript9.dll does maintain a frame pointer.
(In reply to Yuhong Bao from comment #32)
> (In reply to Nathan Froyd (:froydnj) from comment #31)
> > Still in process of setting up a proper 64-bit environment.  Hooray for VS
> > 2010 Express not having 64-bit stuff (or at least I have not convinced it to
> > install such), and 2013 Express only running on Windows 8.
> VS2013 Express for Windows *Desktop* runs on Win7.

Ah, indeed!  I didn't pick up the right link for the download.  Thanks for pointing this out.
> What worries me is the report in comment 23 that VCH aren't always called;

That doesn't seem like something we need to worry about. My read of that example is: on Win7, if your UEF says CONTINUE_SEARCH, then the OS intervenes and kills the process before VCH. In our case, either our UEF would itself kill the process, or Windows can't figure out how to call the UEF so we hit the VCH.
(Reporter)

Comment 36

5 years ago
Oh, well that sounds promising :)  Hopefully we could get QA to test all all the Windows we know about.
(In reply to Nathan Froyd (:froydnj) from comment #27)
> (In reply to Luke Wagner [:luke] from comment #26)
> > Wow, that is a mess.  If there is no way to have VEH prevent the VCH, then I
> > think you're right, the VEH would notify breakpad when it continues
> > execution.
> > 
> > What worries me is the report in comment 23 that VCH aren't always called;
> > it'd be scary to silently drop crash reports.  Since VEH seem fairly solid,
> > I wonder if we could:
> >  - register breakpad as both a VEH and the UEF
> 
> I am confused on this point.  I thought part of the point of this bug is
> that the UEF doesn't work, because SEH doesn't work with the JIT code, per
> comment 0.  So we needed to find an alternative that doesn't rely on the UEF
> being called.
> 
> >  - when the breakpad VEH is called, we don't want to dump yet if there is a
> > live __try, so we CONTINUE_SEARCH from the VEH and rely on the assertion
> > that we only use __try when there isn't JIT code on the stack (so we know
> > we'll get to the UEF)
> 
> Ah, so I guess this addresses my earlier point.  To spell out what I
> understand you to be saying:
> 
> // This handler needs to be added as the last handler, not the first one,
> since it
> // wants to give asm.js's handler a chance to fixup any exceptions first.
> BreakpadVEH(...) {
>   if (!SEHHandlersOnStack) {

Turning this into code appears to make the test in bug 1024272 pass, but I get a "Nightly has stopped working" dialog popping up in the middle of the tests.  So that's highly suboptimal.

Running the full directory of crashreporter xpcshell tests brings up a bunch of those dialogs, as well, I assume one for every test.

The proposed patch also appears to make toolkit\crashreporter\test\unit_ipc\test_content_annotation.js crash reliably, so that's unfortunate.
(In reply to Nathan Froyd (:froydnj) from comment #37)
> The proposed patch also appears to make
> toolkit\crashreporter\test\unit_ipc\test_content_annotation.js crash
> reliably, so that's unfortunate.

AFAICS, the crash happens as expected, the crashreport file just doesn't contain the expected annotations, which seems weird to me.
This gs register stuff doesn't seem reliable to me. Isn't the core of this bug that win64 represents SEH handlers with tables rather than chained pointers?
After a bit more thought, David's right; the SEH stuff is all table-based.  We can determine whether any function on the call stack has an SEH handler--just like the OS does it--but that idea fails because we can't completely unwind the call stack due to the potential presence of JIT functions.  So there might be SEH handlers above the JIT code in the call stack--even though we're not supposed to do that--that we can't find out about.

All of which prompts the gross hack perpetrated in this patch.

I tried (verbosely) describing the idea in the comments; the main idea is to unwind the stack ourselves in a vectored exception handler (that executes after the asm.js one--though I haven't tested this with asm.js code yet...) to determine whether the SEH handling is going to wander off into the weeds due to JIT code.  If we determine that weed-wandering is going to take place, we force a crash dump.  If everything looks OK (i.e. we can unwind all the way back to the beginning of the call stack), then nothing happens.

The good news: all the crashreporter tests pass with this patch.

The bad news: test_crash_AsyncShutdown.js still pops up a "check online/close/debug" dialog on my Windows 7 box (clicking the "close" option makes the test pass...strangely, so does "debug"; I guess something in the test eventually times out and it goes looking for the crash dump?).  I am unsure of how to make this dialog not appear.  (We do want to avoid this dialog appearing, correct?)  I am assuming the sequence of events here is because we trigger nested exceptions (the crash the test invokes and then the crash inside the exception handling machinery) and that causes unclean shutdown, which causes the dialog to pop up.  Feedback on that assumption and/or ideas on how to force a clean shutdown welcome.

Feedback on the patch welcome as well.  The other gross hack that could go on top of this one is that when we determine that we have JIT code on the stack, we register fake unwind info for that JIT code with an exception handler that terminates the process--presumably that counts as a cleaner shutdown.  Then the normal SEH machinery will find our fake unwind information and terminate the process for us!  Better hacks than that welcome.
Attachment #8493973 - Flags: feedback?(dmajor)
Attachment #8493973 - Flags: feedback?(benjamin)
Can we really not just do the right thing and emit unwind information for JIT frames?
Flags: needinfo?(luke)

Comment 42

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #41)
> Can we really not just do the right thing and emit unwind information for
> JIT frames?

Yea, what would be the performance cost of using a frame pointer register on x64?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #41)
> Can we really not just do the right thing and emit unwind information for
> JIT frames?

Luke addressed this in comment 28.  Two JS JIT hackers have looked at this and said, "wow, that's hard".  Maybe we could do the right thing, but not for several months or more.

(In reply to Yuhong Bao from comment #42)
> Yea, what would be the performance cost of using a frame pointer register on
> x64?

Doesn't matter whether you use a frame pointer register on x64; the OS unwinds stack frames without using the frame pointer.
Flags: needinfo?(luke)

Comment 44

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #43)
> (In reply to Yuhong Bao from comment #42)
> > Yea, what would be the performance cost of using a frame pointer register on
> > x64?
> 
> Doesn't matter whether you use a frame pointer register on x64; the OS
> unwinds stack frames without using the frame pointer.
It will use a frame pointer if you tell it to:
http://msdn.microsoft.com/en-us/library/ms235231.aspx
(In reply to Yuhong Bao from comment #44)
> (In reply to Nathan Froyd (:froydnj) from comment #43)
> > (In reply to Yuhong Bao from comment #42)
> > > Yea, what would be the performance cost of using a frame pointer register on
> > > x64?
> > 
> > Doesn't matter whether you use a frame pointer register on x64; the OS
> > unwinds stack frames without using the frame pointer.
> It will use a frame pointer if you tell it to:
> http://msdn.microsoft.com/en-us/library/ms235231.aspx

Sure, you can use the frame pointer as a frame pointer register.  The OS doesn't care how you use the frame pointer register, though; the OS cares about the unwind information you attach to every function.  And generating that unwind information is not straightforward.  Plus you have to ensure that the JIT code follows all the rules described here:

http://msdn.microsoft.com/en-us/library/tawsa7cb.aspx

along with several other things I'm not listing here.

The change is not as simple as removing rbp from the pool of allocatable registers and adding |push rbp| to the prologue of every JIT-generated function.

Comment 46

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #45)
> (In reply to Yuhong Bao from comment #44)
> > (In reply to Nathan Froyd (:froydnj) from comment #43)
> > > (In reply to Yuhong Bao from comment #42)
> > > > Yea, what would be the performance cost of using a frame pointer register on
> > > > x64?
> > > 
> > > Doesn't matter whether you use a frame pointer register on x64; the OS
> > > unwinds stack frames without using the frame pointer.
> > It will use a frame pointer if you tell it to:
> > http://msdn.microsoft.com/en-us/library/ms235231.aspx
> 
> Sure, you can use the frame pointer as a frame pointer register.  The OS
> doesn't care how you use the frame pointer register, though; the OS cares
> about the unwind information you attach to every function.  And generating
> that unwind information is not straightforward.  Plus you have to ensure
> that the JIT code follows all the rules described here:
> 
> http://msdn.microsoft.com/en-us/library/tawsa7cb.aspx
> 
> along with several other things I'm not listing here.
> 
> The change is not as simple as removing rbp from the pool of allocatable
> registers and adding |push rbp| to the prologue of every JIT-generated
> function.

I know, the point is that if you tell it in the unwind information to use a frame pointer register, they will use it during unwind. (read the sample code at the bottom)
(In reply to Yuhong Bao from comment #46)
> I know, the point is that if you tell it in the unwind information to use a
> frame pointer register, they will use it during unwind. (read the sample
> code at the bottom)

Yes, of course.  But that's completely irrelevant to the discussion here.
(In reply to Nathan Froyd (:froydnj) from comment #47)
> (In reply to Yuhong Bao from comment #46)
> > I know, the point is that if you tell it in the unwind information to use a
> > frame pointer register, they will use it during unwind. (read the sample
> > code at the bottom)
> 
> Yes, of course.  But that's completely irrelevant to the discussion here.

My apologies, that was unnecessarily harsh.

We could certainly see what removing a register from the pool of allocatable registers--simulating a frame pointer--would do, performance and codesize-wise.

But my impression is that the real question is not "do we move forward with generating unwind information with or without frame pointers?" but "do we move forward with generating unwind information at all?"  The first question would arise naturally in the implementation of generating unwind information, if we decided to do it.
(Reporter)

Comment 49

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #41)
> Can we really not just do the right thing and emit unwind information for
> JIT frames?

Yes, see comment 28.

Nathan: what about the much simpler hack dmajor hinted at in comment 25:
If we can depend on the VCH running always (that is, dmajor suggests there aren't bugs in VCH handling in comment 35), then we set a flag in AsmJSFaultHandler saying "I handled this one" and the VCH checks this flag, allowing execution to continue w/o dump if set (clearing afterwards).

The only question is whether we can receive faults while handling faults... I expect not; but we'd need to check.
(In reply to Luke Wagner [:luke] from comment #49)
> Nathan: what about the much simpler hack dmajor hinted at in comment 25:
> If we can depend on the VCH running always (that is, dmajor suggests there
> aren't bugs in VCH handling in comment 35), then we set a flag in
> AsmJSFaultHandler saying "I handled this one" and the VCH checks this flag,
> allowing execution to continue w/o dump if set (clearing afterwards).

I don't think this works, because of the crashreporter tests; we still want to do crash dumps for actual faults that aren't handled by asm.js, but if any JIT code is on the stack when those faults happen, then we lose because SEH can't unwind through the JIT code.  I guess it's possible that VCHs still execute if SEH handling goes awry; we'd have to check, but I would be surprised if they did.

> The only question is whether we can receive faults while handling faults...
> I expect not; but we'd need to check.

Pretty sure not, since that's what happens when the SEH unwinding wanders off.
(Reporter)

Comment 51

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #50)
> I guess it's possible that VCHs
> still execute if SEH handling goes awry; we'd have to check, but I would be
> surprised if they did.

From dmajor has said, VCHs do give us this guarantee.  (In fact the problem is they are called in too *many* situation ;)
(In reply to Luke Wagner [:luke] from comment #51)
> (In reply to Nathan Froyd (:froydnj) from comment #50)
> > I guess it's possible that VCHs
> > still execute if SEH handling goes awry; we'd have to check, but I would be
> > surprised if they did.
> 
> From dmajor has said, VCHs do give us this guarantee.  (In fact the problem
> is they are called in too *many* situation ;)

Ah, indeed he did say that in comment 5.  Too many things to keep straight.

I like the self-containedness of the breakpad vectored exception solution, but I guess we have various other bits of the JS engine that reach into the crash reporter, sooo...
This patch implements Luke/dmajor's suggestion of catching things in a vectored continue handler, but requiring asm.js to set a flag indicating whether it handled the exception in its vectored exception handler.  Passes crashreporter tests, though I still have to click through the dialog for closing the crashes triggered by test_crash_AsyncShutdown.js.
Attachment #8494539 - Flags: feedback?(luke)
Attachment #8494539 - Flags: feedback?(benjamin)
Better patch that contains all the js/ bits.
Attachment #8494539 - Attachment is obsolete: true
Attachment #8494539 - Flags: feedback?(luke)
Attachment #8494539 - Flags: feedback?(benjamin)
Attachment #8494541 - Flags: feedback?(luke)
Attachment #8494541 - Flags: feedback?(benjamin)
(Reporter)

Comment 55

5 years ago
Comment on attachment 8494541 [details] [diff] [review]
dump crash reports in a vectored continue handler

Review of attachment 8494541 [details] [diff] [review]:
-----------------------------------------------------------------

Looks about right, nice work!

::: js/src/asmjs/AsmJSSignalHandlers.cpp
@@ +519,5 @@
> +    if (HandleException(exception)) {
> +        // We need to inform other places, notably the crash reporter, that
> +        // this exception has been handled.  These other places are
> +        // responsible for clearing this flag when they are done.
> +        AsmJSHandledException = true;

Perhaps:
  if (AsmJSHandledException)
      MOZ_CRASH("AsmJSHandledException");
The reason being that, if by some bizarre bug, the flag was set and then failed to be reset, then if a real crash occured, the VCH would (incorrectly) see the flag set, return CONTINUE_EXECUTION, this would *re*fault, and then we'd get here with the flag set and MOZ_CRASH() (which, we should verify, would actually crash for real :).  Now, the current VCH returns CONTINUE_SEARCH; maybe we can change that?
Attachment #8494541 - Flags: feedback?(luke) → feedback+
Comment on attachment 8493973 [details] [diff] [review]
make breakpad reliably dump crash reports in the presence of the JIT

I'm happy to let dmajor handle the feedback here. He'll need to work with Ted to get the breakpad changes upstream.
Attachment #8493973 - Flags: feedback?(benjamin)

Updated

5 years ago
Attachment #8494541 - Flags: feedback?(benjamin)
(Reporter)

Comment 57

5 years ago
I'd be interested to hear from the Google breakpad people what Chrome did on Win64 with their JIT.
From an email thread today, they have done nothing: mento linked us to https://code.google.com/p/chromium/issues/detail?id=380240#c4
Looks like the patch is not quite perfect:

https://tbpl.mozilla.org/?tree=Date&rev=19b32738b1bb

(And that's after I fixed up some silly Win64 things)

M1 crashes (asmjscache test, even), M3 unexpected crash dumps, devtools failures, and crashtests.  Lovely.  But xpcshell tests are green! =D
(Reporter)

Comment 60

5 years ago
The asm.js/testCall.js JIT failures are likely bug 1071444 (unrelated) and fixed with 1071444 (which doesn't appear to be on Date).  Perhaps re-pull and try again?
Some corner cases to think about:

Nested exceptions: Breakpad's AutoExceptionHandler solves this by removing itself from UEF [1]. I don't know what the VEH/VCH equivalent of this would be.

Parallel exceptions on different threads: These might stomp over the global. Maybe this is rare enough that we don't care. I guess we can only write one dump anyway. Just want to make sure we don't get false positives.

Other apps setting up a VEH in front of ours: Given Win64's current nightly-only status, I really doubt anyone does this.

In any case, as long as we avoid reporting spurious crashes, then anything will be an improvement over the situation today, even if we miss a few.

Also I really want to figure out why you're getting a Windows fail dialog (which I think is the equivalent of my postmortem debugger at comment 6). I'll try to look into this some more.

[1] http://hg.mozilla.org/mozilla-central/file/5e704397529b/toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc#l419
(Reporter)

Comment 62

5 years ago
Oops, good catch; I think we really need to use TLS for the 'handled' flag.  This could cause spurious crashes with the wrong interleaving.

When can we get a nested exception?  If we fault during breakpad (or AsmJSFaultHandler)?  One thing we could do is have AsmJSFaultHandler crash if the flag is already set.  This would also be useful for catching bugs.

As for the fail dialog, I wonder if that might go away if the VCH returned CONTINUE_EXECUTION instead of CONTINUE_SEARCH as it does in the patch.
> When can we get a nested exception?  If we fault during breakpad (or
> AsmJSFaultHandler)?

Breakpad, yes. Or one-off AsmJS failures. If the fault happens repeatedly in the AsmJS handler then we overflow the stack like bug 1058168.
(Reporter)

Comment 64

5 years ago
(In reply to David Major [:dmajor] from comment #63)
> Breakpad, yes. Or one-off AsmJS failures. If the fault happens repeatedly in
> the AsmJS handler then we overflow the stack like bug 1058168.

In general, we have the 'handlingSignal' flag that is supposed to catch that.  The problem in bug 1058168 was that we were faulting while getting to this flag!
> Also I really want to figure out why you're getting a Windows fail dialog
> (which I think is the equivalent of my postmortem debugger at comment 6).
> I'll try to look into this some more.

Aha, so the culprit is MOZ_CRASHREPORTER_NO_REPORT [1], which some tests use [2]. That environment variable means the crash reporter is active and writes .dmp files, but does not launch the reporter app. In code terms it means that MinidumpCallback returns before it reaches TerminateProcess [3].

According to [4] a UEF can return EXCEPTION_EXECUTE_HANDLER to indicate that nothing else needs to be done -- which I guess means no sadface dialog. Since this is not a valid return value for a VCH (I tried anyway!) I guess we need to terminate manually. I made this change to the end of the VCH and test_crash_AsyncShutdown.js passes without popping up my debugger:

   (void) current_handler->AttemptToWriteCrashReport(exinfo)
+  TerminateProcess(GetCurrentProcess(), 1);
   return EXCEPTION_CONTINUE_SEARCH;

(It was a quick and dirty test. In reality we might want to be conditional on the return value of AttemptToWriteCrashReport)

[1] http://hg.mozilla.org/mozilla-central/file/5e704397529b/toolkit/crashreporter/nsExceptionHandler.cpp#l1003
[2] http://hg.mozilla.org/mozilla-central/file/5e704397529b/testing/xpcshell/runxpcshelltests.py#l859
[3] http://hg.mozilla.org/mozilla-central/file/5e704397529b/toolkit/crashreporter/nsExceptionHandler.cpp#l809
[4] http://hg.mozilla.org/mozilla-central/file/5e704397529b/toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc#l481
Also, to get the crashtests passing I made the VCH opt out of exception 0x406D1388, which is the secret handshake for setting a thread name. I suspect it would fix some of the mochitests as well.

http://msdn.microsoft.com/en-us/library/xcb2z8hs.aspx
http://dxr.mozilla.org/mozilla-central/search?q=0x406D1388&case=false
(In reply to David Major [:dmajor] from comment #65)
> Aha, so the culprit is MOZ_CRASHREPORTER_NO_REPORT [1], which some tests use
> [2]. That environment variable means the crash reporter is active and writes
> .dmp files, but does not launch the reporter app. In code terms it means
> that MinidumpCallback returns before it reaches TerminateProcess [3].

All of our automated testing uses that, FYI (so that we don't get the crash reporter dialog). It seems reasonable to stick a TerminateProcess in the noReport case:
http://hg.mozilla.org/mozilla-central/annotate/32acbe1d64dc/toolkit/crashreporter/nsExceptionHandler.cpp#l809
(In reply to David Major [:dmajor] from comment #66)
> Also, to get the crashtests passing I made the VCH opt out of exception
> 0x406D1388, which is the secret handshake for setting a thread name. I
> suspect it would fix some of the mochitests as well.
> 
> http://msdn.microsoft.com/en-us/library/xcb2z8hs.aspx
> http://dxr.mozilla.org/mozilla-central/search?q=0x406D1388&case=false

Words fail me.  (Shouldn't that exception be handled by SEH and therefore the VCH should never see it?!)

Anyway, you are correct, filtering this exception makes tests happy:

https://tbpl.mozilla.org/?tree=Date&rev=12ee3a176bfe

Still need to think about the corner cases raised in comment 61, but that push looks pretty green.  I'm a little nervous about downgrading the MOZ_ASSERT in CrashReporter::RegisterExceptionHandlingFlag to an existence check, but the assertion was firing during packaging, for reasons that I haven't looked into closely.

I tried pushing the more complicated approach in attachment 8493973 [details] [diff] [review], but that did not seem to go so well:

https://tbpl.mozilla.org/?tree=Date&rev=7324db521e0d

I'm not sure what's going wrong here, though I would not be *completely* surprised if there were a few functions in Microsoft system libraries that didn't have unwind information in them, and the default return-address-at-rsp following works just fine.  (Even system implementers sometimes forget their own rules about their system.)  But our system doesn't know about that, and falls over.  I think the stack unwinding approach is more robust and more self-contained (which might make it more acceptable upstream), but I'm not sure I have the patience to figure out what's causing those test failures in the above push.

Comment 69

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #68)
> I'm not sure what's going wrong here, though I would not be *completely*
> surprised if there were a few functions in Microsoft system libraries that
> didn't have unwind information in them, and the default
> return-address-at-rsp following works just fine. 

I think it is officially supported if it is a "leaf" function.
> I tried pushing the more complicated approach in attachment 8493973 [details] [diff] [review]
> [details] [diff] [review], but that did not seem to go so well:
> I think the stack unwinding
> approach is more robust and more self-contained (which might make it more
> acceptable upstream), but I'm not sure I have the patience to figure out
> what's causing those test failures in the above push.

Conceptually I like your unwinding patch, even though the guts are kind of frightening. It doesn't require faraway code to be in on the scheme.

Looking at the stacks in the test logs, I think the code is too eager to write a dump. We should write a dump only if the exception is still not handled by the time the unwind reaches the JIT code. Determining that sounds hard. We can't just use the presence or absence of other handlers, since they may or may not want the exception.

Crazy idea: What if we use RtlAddFunctionTable and give it mostly-bogus data? All we need is the "I have an exception handler" flag, then we could write a dump and terminate before the unwinder has a chance to go astray.
(In reply to David Major [:dmajor] from comment #70)
> Looking at the stacks in the test logs, I think the code is too eager to
> write a dump. We should write a dump only if the exception is still not
> handled by the time the unwind reaches the JIT code. Determining that sounds
> hard. We can't just use the presence or absence of other handlers, since
> they may or may not want the exception.

Oh, good point.  I suppose one could easily wind up with:

  [JIT code]
    ...
      Create a new thread with a worker
      Set the thread name

and the exception would be handled by the thread-name setting code, and never get to the JIT code.

> Crazy idea: What if we use RtlAddFunctionTable and give it mostly-bogus
> data? All we need is the "I have an exception handler" flag, then we could
> write a dump and terminate before the unwinder has a chance to go astray.

Yeah, that's basically what I was suggesting in the last part of comment 40.  But your observation above indicates there are two different cases to handle here:

1. Call RtlAddFunctionTable, and then reach the newly added handler, which terminates the program cleanly;
2. Call RtlAddFunctionTable, and then not reach the newly added handler, because somebody else handled the error below us.

We need to call RtlDeleteFunctionTable in the second case, I think, both for our own peace of mind later on, and because we might eventually wind up with multiple function tables for a piece of the address space, which would break things in interesting ways.  But we don't call VCHs after successful SEH unwinds, so I don't think there's a good place to stick that function call in.  Deleting pre-existing function tables on entrance to our VEH could be done, although I'm not sure how threading complicates the issue there...

Would it be feasible to just push off the responsibility of installing "dummy" function tables to the JIT/GC instead?  The JS engine can simply add function tables for large blocks of memory whose handlers do nothing but terminate the process.  And those tables can be deleted and re-added as necessary during GC--assuming we move large blocks of compiled code during GC, I'm ignorant on how the particular details of the JIT/GC work here.  Then we wouldn't even need any breakpad modifications, since SEH would always find dummy handlers for JIT code.  Luke, how feasible is that idea?
Flags: needinfo?(luke)
(Reporter)

Comment 72

5 years ago
Iiiinteresting.  Let me see if I understand what the idea here:
 - If a structured exception reaches JIT code, there is no hope; we assume that JIT code is never run from within a __try block so there can be no saving us now.
 - So we simply register a handler for JIT code that does the same thing as the UEF

If so, that sounds great; it feels like as close to the "right thing" as we can do without doing the right thing :)  I'm guessing we'd have a new JSAPI callback that gets called when new code is generated, before it is executed.  Breakpad would implement this hook (registering it for all JSRuntimes) and use the callback to call RtlAddFunctionTable.  Sound right?
Flags: needinfo?(luke)
(In reply to Luke Wagner [:luke] from comment #72)
> Iiiinteresting.  Let me see if I understand what the idea here:
>  - If a structured exception reaches JIT code, there is no hope; we assume
> that JIT code is never run from within a __try block so there can be no
> saving us now.
>  - So we simply register a handler for JIT code that does the same thing as
> the UEF

Yup, that's the idea.

> If so, that sounds great; it feels like as close to the "right thing" as we
> can do without doing the right thing :)  I'm guessing we'd have a new JSAPI
> callback that gets called when new code is generated, before it is executed.
> Breakpad would implement this hook (registering it for all JSRuntimes) and
> use the callback to call RtlAddFunctionTable.  Sound right?

I think Breakpad would need to know when that code is GC'd, too, so it could call RtlDeleteFunctionTable, and that seems somewhat heavyweight.  I was envisioning something like a hook that gets called when the GC (or the JIT?  not sure who manages the memory devoted to JIT'd functions here) requests new blocks of memory from the OS.  It makes the overhead somewhat less if we're registering tables for regions of MB rather than KB, and we don't care whether the blocks are exclusively devoted to code, or whether there's data mixed in among them.  
And of course we'd need an API for if/when those blocks get released back to the OS as well.

I was thinking that the JS engine could setup the handlers itself, and call a JSRuntime hook for the body of the handler.  The default handler would call TerminateProcess, but XPCJSRuntime would set the hook to call into Breakpad to do crash reporting, then terminate the process.  I'd rather not thread something from the JS engine, through our Breakpad wrapper, to Breakpad itself, if we can avoid it.
Also, who's on point for determining what we do here to actually ship something?  Having one blocking bug for Win64 sit around for two months doesn't seem like a good thing, and it'd be bad if we let this sit around for a couple more months.

We have a solution (comment 68) that turns tests green.  So we can either:

- Fix that up to DTRT with regards to threads and maybe a few other things, but probably have to maintain local changes to Breakpad indefinitely; or

- Write up the untested and somewhat more fiddly scheme described in the last several comments, which might take...oh, I don't know, I'd guess a couple of weeks, and which might have interesting unforeseen consequences.

Benjamin, I'm going to nominate you for making a decision.  WDYT?
Flags: needinfo?(benjamin)
(Reporter)

Comment 75

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #73)
> I think Breakpad would need to know when that code is GC'd, too, so it could
> call RtlDeleteFunctionTable, and that seems somewhat heavyweight.  I was
> envisioning something like a hook that gets called when the GC (or the JIT? 
> not sure who manages the memory devoted to JIT'd functions here) requests
> new blocks of memory from the OS.  It makes the overhead somewhat less if
> we're registering tables for regions of MB rather than KB, and we don't care
> whether the blocks are exclusively devoted to code, or whether there's data
> mixed in among them.  
> And of course we'd need an API for if/when those blocks get released back to
> the OS as well.

You're right; I didn't actually mean "for each JIT function" above; I was thinking we'd put this at the of OS allocation/release (which has the added benefit that there are only a few pinchpoints for these).

> I was thinking that the JS engine could setup the handlers itself, and call
> a JSRuntime hook for the body of the handler.  The default handler would
> call TerminateProcess, but XPCJSRuntime would set the hook to call into
> Breakpad to do crash reporting, then terminate the process.  I'd rather not
> thread something from the JS engine, through our Breakpad wrapper, to
> Breakpad itself, if we can avoid it.

That works too.  Actually, even simpler, can we just call MOZ_CRASH from our handler (and will that correctly trigger breakpad, even if we are already in a handler)?  I'd hope so.
(Reporter)

Comment 76

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #74)
I can whip up the patch for what we're discussing here if you can help me test it.
(In reply to Nathan Froyd (:froydnj) from comment #73)
> I was envisioning something like a hook that gets called when the GC requests
> new blocks of memory from the OS.

> And of course we'd need an API for if/when those blocks get released back to
> the OS as well.

The GC allocates chunks at [1] and releases them at [2] (Windows-specific). Some care might need to be taken with the latter, since it can be called during allocation itself, but otherwise this is centralized. I believe OdinMonkey does its own thing, but Luke would know about that.

[1] http://hg.mozilla.org/mozilla-central/annotate/5075aec17a1e/js/src/gc/Memory.cpp#l120
[2] http://hg.mozilla.org/mozilla-central/annotate/5075aec17a1e/js/src/gc/Memory.cpp#l255
(In reply to Luke Wagner [:luke] from comment #76)
> (In reply to Nathan Froyd (:froydnj) from comment #74)
> I can whip up the patch for what we're discussing here if you can help me
> test it.

I will volunteer dmajor for that help too, but sure!
(Reporter)

Updated

5 years ago
Assignee: nobody → luke

Updated

5 years ago
Flags: needinfo?(benjamin)

Comment 79

5 years ago
hey

i've put together a wip patch for v8 that tries to solve this problem here: https://codereview.chromium.org/606013002

basically, it appears to be possible to install an exception handler for the entire memory range we put our jitted code in (which also on 64bit is a continuous range to allow for 32bit relative addressing within code blocks).

the downside is that the crash dumps look really ugly in cdb as it still can't unwind the stack.
(Reporter)

Comment 80

5 years ago
Hah, looks like both teams came to the same conclusion :)  Thanks for linking us to the patch!
Another possibility is RtlInstallFunctionTableCallback to delay the work until unwind time, if you prefer.
> That works too.  Actually, even simpler, can we just call MOZ_CRASH from our
> handler (and will that correctly trigger breakpad, even if we are already in
> a handler)?  I'd hope so.

There would still be JIT on the stack so I suspect we'd fail to walk back to breakpad.

But if you wanted to avoid coupling with breakpad code, you could SetUnhandledExceptionFilter(nullptr) to get back the current UEF, and call it directly.
(Reporter)

Comment 83

5 years ago
(In reply to David Major [:dmajor] from comment #82)
Oh, of course, you're right; and that's a good idea.
Doh, I forgot we have a patched_SetUnhandledExceptionFilter that blocks such things.
We could just add an API to our hook there to hand back the UEF to call.
(Reporter)

Comment 86

5 years ago
Posted patch set-uef (obsolete) — Splinter Review
This patch adds the JS engine parts which are pretty trivial; there are only two places we allocate executable pages.  I realized I don't know enough about breakpad to do the other parts, so perhaps someone else could take this patch and call js::SetUnhandledExceptionFilter (in jsfriendapi.h) early in process initialization (once for the whole process)?

RtlInstallFunctionTableCallback ended up being way simpler than RtlAddFunctionTable because the latter appears to need a custom entry point (at least, that's what the v8 patch did).  My only concern is whether this callback is perhaps called spuriously, in cases where unwinding hasn't actually reached the code on the stack?  I did a bit of quick testing locally and it seems to work...
(Reporter)

Comment 87

5 years ago
(assigning this to myself was a bit overzealous)
Assignee: luke → nobody
> This patch adds the JS engine parts which are pretty trivial; there are only
> two places we allocate executable pages.  I realized I don't know enough
> about breakpad to do the other parts, so perhaps someone else could take
> this patch and call js::SetUnhandledExceptionFilter (in jsfriendapi.h) early
> in process initialization (once for the whole process)?

We can do this easily in nsExceptionHandler.cpp (technically not part of Breakpad, so no upstream to worry about) -- I'll attach a patch.

> RtlInstallFunctionTableCallback ended up being way simpler than
> RtlAddFunctionTable because the latter appears to need a custom entry point
> (at least, that's what the v8 patch did).

The signature for a UEF is:
typedef LONG (WINAPI *PTOP_LEVEL_EXCEPTION_FILTER)(
    _In_ struct _EXCEPTION_POINTERS *ExceptionInfo
    );

Ideally we'd want Windows to call it along with the current EXCEPTION_POINTERS so that Breakpad can write the relevant information to the dump. So I think the callback should return a RUNTIME_FUNCTION that points to the UEF. Well, indirectly at least...

The RUNTIME_FUNCTION's UnwindInfoAddress member needs to be a 32-bit offset relative to the start of the JIT region. So I guess the JIT region itself would have to contain the UnwindInfo struct, as well as a stub that calls sUnhandledExceptionFilter.

...which is more or less what the v8 patch does. Maybe these hurdles negate the advantage of RtlInstallFunctionTableCallback over RtlAddFunctionTable.

I simulated this approach by clobbering a bunch of values in my debugger, and it seemed to work OK, but I don't know how to code this up 'properly'. Luke can you help with that piece?
Posted patch Tell JS about the UEF (obsolete) — Splinter Review
(Reporter)

Comment 90

5 years ago
Posted patch combined.patch (obsolete) — Splinter Review
Phew, I emulated the v8 patch so we could get the context.  This was mostly easy; what took most of the day was:
 - exceptionHandler is a ULONG, not a word; I'm not sure how v8 works with an Address
 - exception handler's signature isn't the same as UnhandledExceptionFilter; it's totally different (see PEXCEPTION_HANDLER comments in patch)
 - our MacroAssembler isn't quite so easy to point at random code so I had to hardcode the opcodes

How does this look?  It probably doesn't build on Linux and it's a combination of three patches.
Attachment #8497131 - Attachment is obsolete: true
Attachment #8497849 - Flags: feedback?(dmajor)
Comment on attachment 8497849 [details] [diff] [review]
combined.patch

Looks great!

test_crash_AsyncShutdown.js passes, as do various other tests on Date that were problems before. I pushed this to Date for a full test run, Linux build be damned :) https://tbpl.mozilla.org/?tree=Date&rev=274c9ce35ea2
Attachment #8497849 - Flags: feedback?(dmajor) → feedback+
>  - exceptionHandler is a ULONG, not a word; I'm not sure how v8 works with
> an Address

I suspect v8's version of these need to be 'unsigned char :4' or else there will be some padding before the next field. Maybe that hid the size difference?
	41 unsigned frame_register : 4;
	42 unsigned frame_offset : 4;

Hopefully Jochen is reading this; I don't have an account to comment on the other review.
(Reporter)

Comment 93

5 years ago
Posted patch hoist-allocSplinter Review
This prepares for the next patch, which does something interesting in the Win impl.
Attachment #8497930 - Flags: review?(jdemooij)
(Reporter)

Comment 94

5 years ago
Posted patch call-uef (obsolete) — Splinter Review
Perhaps for this review David can take all the Windows magiks and Jan can review for general Ion fit.
Attachment #8497931 - Flags: review?(jdemooij)
Attachment #8497931 - Flags: review?(dmajor)
Comment on attachment 8497930 [details] [diff] [review]
hoist-alloc

Review of attachment 8497930 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/ExecutableAllocatorPosix.cpp
@@ +53,5 @@
> +js::jit::DeallocateExecutableMemory(void *addr, size_t bytes, size_t pageSize)
> +{
> +    JS_ASSERT(bytes % pageSize == 0);
> +    mozilla::DebugOnly<int> result = munmap(addr, bytes);
> +    MOZ_ASSERT(!result || errno == ENOMEM);

Nit: this function uses both JS_ASSERT and MOZ_ASSERT; plan is to s/JS_ASSERT/MOZ_ASSERT soon so I'd use MOZ* everywhere.
Attachment #8497930 - Flags: review?(jdemooij) → review+
Comment on attachment 8497931 [details] [diff] [review]
call-uef

Review of attachment 8497931 [details] [diff] [review]:
-----------------------------------------------------------------

I only skimmed the Windows code. Glad this works!

::: js/src/jit/ExecutableAllocatorWin.cpp
@@ +221,5 @@
> +        addr = (uint8_t*)addr - pageSize;
> +        UnregisterExecutableMemory(addr, bytes, pageSize);
> +    }
> +#endif
> +    

Nit: some trailing whitespace

::: js/src/jsfriendapi.h
@@ +2662,4 @@
>  ExecuteInGlobalAndReturnScope(JSContext *cx, JS::HandleObject obj, JS::HandleScript script,
>                                JS::MutableHandleObject scope);
>  
> +#if defined(XP_WIN) && defined(_WIN64)

Nit: s/_WIN64/JS_CPU_X64 for consistency with the rest of the patch

@@ +2678,5 @@
> +// a handler with all JIT code that simply calls breakpad's unhandled exception
> +// filter (which will perform crash reporting and then terminate the process).
> +// This would be wrong if there was an outer __try block that expected to handle
> +// the fault, but we require this never happen.
> +// 

Nit: trailing whitespace
Attachment #8497931 - Flags: review?(jdemooij) → review+
(Reporter)

Comment 97

5 years ago
(In reply to Jan de Mooij [:jandem] from comment #96)
> Nit: s/_WIN64/JS_CPU_X64 for consistency with the rest of the patch

I specifically avoided that here since jsfriendapi.h is an installed header and I didn't think we had the JS_CPU_* #defines (I think I've hit this problem before).

Comment 98

5 years ago
I based my unwind_info on the version I found in VC/crt/src/ehdata.h.

Your version looks more sane. My current patch is here btw: https://codereview.chromium.org/619543002
(Reporter)

Comment 99

5 years ago
General try push with #ifdef _WIN64 to fix WinXP bustage:
https://tbpl.mozilla.org/?tree=Try&rev=84389936e4fd

If this all works, the one minor remaining thing I'd like to address in the "Tell JS about the UEF" patch is avoiding the function-pointer cast with the XXX comment.  Since this funcptr is now being called from our own C++, we can use whatever signature.
(Reporter)

Comment 100

5 years ago
(In reply to Jochen Eisinger from comment #98)
> Your version looks more sane. My current patch is here btw:
> https://codereview.chromium.org/619543002

breakpad_win.cc looks familiar ;)  Out of curiosity, where did you get the signature of CrashForExceptionInNonABICompliantCodeRange from?  I think it'll compile down to the same ABI, but I was curious about the different type names and if there is a more official documentation of this typedef than Wine.

More generally, if RegisterNonABICompliantCodeRange is put inside breakpad, it'd be nice for us to reuse that rather than duplicating the logic in SM.  I guess we'd need to wait for up/down-streaming, though.
(Reporter)

Comment 101

5 years ago
(In reply to Luke Wagner [:luke] from comment #100)
> Out of curiosity, where did you get the
> signature of CrashForExceptionInNonABICompliantCodeRange from?

Oh I see PEXCEPTION_ROUTINE in ehdata.h.  That whole header is slightly off, though (wrong field sizes of UNWIND_INFO, ExceptionHandler is in RUNTIME_FUNCTION, not UNWIND_INFO, etc); it's funny that Wine is a better resource.
Comment on attachment 8493973 [details] [diff] [review]
make breakpad reliably dump crash reports in the presence of the JIT

Review of attachment 8493973 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like Luke is doing something better (thanks Luke!), no point in having this request open.
Attachment #8493973 - Flags: feedback?(dmajor)

Comment 103

5 years ago
there's also http://msdn.microsoft.com/en-us/library/b6sf5kbd.aspx

I'm not sure whether the RegisterNonABICompliantCodeRange method is a good fit for a library - it's a huge hack :-/
Comment on attachment 8497931 [details] [diff] [review]
call-uef

Review of attachment 8497931 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/ExecutableAllocatorWin.cpp
@@ +143,5 @@
> +    // sUnhandledExceptionFilter, we must generate a little thunk inside the
> +    // record. The record is put on its own page so that we can take away write
> +    // access to protect against accidental clobbering.
> +
> +    r->runtimeFunction.BeginAddress = 0;

Maybe we should start at pageSize?

@@ +159,5 @@
> +    // mov imm64, rax
> +    r->thunk[0]  = 0x48;
> +    r->thunk[1]  = 0xb8;
> +    void *handler = &ExceptionHandler;
> +    memcpy(&r->thunk[2], &handler, 8);

Perhaps sizeof(void*)?
Attachment #8497931 - Flags: review?(dmajor) → review+
(Reporter)

Comment 105

5 years ago
(In reply to :dmajor (away 3-7 October) from comment #104)
> Maybe we should start at pageSize?

That sounds fine.  Since the first page ends up non-executable, it shouldn't matter in practice.

> Perhaps sizeof(void*)?

I thought about that, but it seems like, since we're depending on the exact number of bytes (and not being sizeof-generic in this computation) it was better to be totally explicit.

Want me to take over your "Tell JS about the UEF" patch, or would you like to do that?
> Want me to take over your "Tell JS about the UEF" patch, or would you like
> to do that?

Feel free to take it. Ted is the best reviewer for that file. Btw, I added the TerminateProcess call pretty hastily. It should probably have at least a comment and maybe be a separate patch.
(Reporter)

Comment 107

5 years ago
Attachment #8497320 - Attachment is obsolete: true
Attachment #8498612 - Flags: review?(ted)
(Reporter)

Comment 108

5 years ago
This patch has the updated jsfriendapi.h signature, for reference.  Carrying forward r+.
Attachment #8497931 - Attachment is obsolete: true
Attachment #8498613 - Flags: review+
(Reporter)

Updated

5 years ago
Attachment #8497849 - Attachment is obsolete: true

Comment 109

5 years ago
fyi, I landed my change, and we're seeing crash reports from jitted code again.
(Reporter)

Comment 110

5 years ago
Fantastic, and thanks for all the friendly advice/updates.  If you do see any problems in the field in the future, we'd much appreciate hearing about them :)

Comment 111

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #68)
> I'm not sure what's going wrong here, though I would not be *completely*
> surprised if there were a few functions in Microsoft system libraries that
> didn't have unwind information in them, and the default
> return-address-at-rsp following works just fine.  (Even system implementers
> sometimes forget their own rules about their system.)
AFAIK, all MS JITs I know of do the unwind tables correctly, but I hasn't seen a non-MS JIT that does..
Comment on attachment 8498612 [details] [diff] [review]
TellJSAboutUEF.txt

Stealing this from Ted.
Attachment #8498612 - Flags: review?(ted) → review+
Excellent work, gentlemen. Thank you.
You need to log in before you can comment on or make changes to this bug.