Closed Bug 1302078 Opened 3 years ago Closed 3 years ago

Rust panic handling broken on Win64

Categories

(Core :: General, defect)

x86_64
Windows
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox51 --- wontfix
firefox52 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- fixed

People

(Reporter: ted, Assigned: ted)

References

(Blocks 1 open bug)

Details

(Whiteboard: [stylo])

Attachments

(1 file)

Using the 2016-09-08 Win64 Nightly build, which has my patch for bug 1300152 before it was backed out, and triggering a Rust panic using the API there by evaluating (in the browser console): `Components.classes["@mozilla.org/xpcom/debug;1"].getService(Components.interfaces.nsIDebug2).rustPanic("OH NO")`, I get a crash that isn't handled by Breakpad. It looks like this in WinDBG:

(1738.6b4): Illegal instruction - code c000001d (first chance)
(1738.6b4): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlImageNtHeaderEx+0x343:
00007ffa`c1d891d3 498b00          mov     rax,qword ptr [r8] ds:00000000`00000000=????????????????
0:000> kp
Child-SP          RetAddr           Call Site
000000de`69ff2a80 00007ffa`c1dea10a ntdll!RtlImageNtHeaderEx+0x343
000000de`69ff3180 00007ffa`7fffdc2d ntdll!KiUserExceptionDispatcher+0x3a
000000de`69ff3880 00007ffa`7ffff56a xul!mp4parse_get_track_video_info+0x14fd
000000de`69ff3e10 00007ffa`7ffff150 xul!mp4parse_get_track_video_info+0x2e3a
000000de`69ff3e60 00007ffa`7ffed1b7 xul!mp4parse_get_track_video_info+0x2a20
000000de`69ff3ef0 00004ff4`67180b00 xul!intentional_panic+0x1627 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\obj-firefox\toolkit\library\rust\<std macros> @ 8]
000000de`69ff3ef8 00007ffa`8009e77b 0x00004ff4`67180b00
000000de`69ff3f00 00000000`00000000 xul!nsComponentManagerImpl::GetService(struct nsID * aClass = 0x00000000`00000001, struct nsID * aIID = 0x00000000`00000000, void ** aResult = 0x000001fe`f2890601)+0xb3 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\components\nscomponentmanager.cpp @ 1357]

I'm not sure what's going on here yet.
Does Gecko currently register a panic hook with the Rust code it links in? Also, does Gecko compile with panic=abort? Technically something like:

    extern fn foo() {
        panic!();
    }

is undefined behavior. All our `extern` functions assume they will not be unwound through. If, however, you've installed a panic hook to abort unwinding (or you're compiling with panic=abort) then this'll be fine as it'll just abort the process.
Yes, Gecko compiles with panic=abort:

http://dxr.mozilla.org/mozilla-central/source/toolkit/library/rust/Cargo.toml#37

Confirmed with looking at the logs from mozilla-central.  (Ted's patches didn't change anything in that respect.)
Ok, thanks! In that case this it looks like (I think) from the Rust side everything is pretty standard. On Windows an implementation of panic when code is compiled with panic=abort is just an illegal instruction -- https://github.com/rust-lang/rust/blob/85592fbe60ee4e6878bb1f11da0243c3a3a440f3/src/libpanic_abort/lib.rs#L64-L67.

The message here seems to indicate that

> (1738.6b4): Illegal instruction - code c000001d (first chance)

So maybe that's not getting translated to a crash in Rust?
My first guess, just looking at the stack, is that something in Windows is looking for an exception handler and crashing in that process. Perhaps there's something about the generated Rust code that it doesn't like? Win64 uses a set of lookup tables for exception handling. It's allowable for leaf functions to not have entries there, but I wonder if we're doing something in a way that Windows does not expect.
Just for clarification, those three frames in the stack that are "xul!mp4parse_get_track_video_info+xxx" are inlined bits of the Rust standard library where we lack debug info.

If I build a standalone Rust project that uses panic=abort, and run that under the debugger, the debugger breaks on the "ud2" instruction like I would expect. It must be something about linking Rust code into another binary, or calling Rust code from C that exhibits this behavior.
Here's my test program:
https://github.com/luser/test-panic

It's building a Rust static library with basically the same code I tried landing into Gecko, then linking that into a binary and calling that function from C. Running it under a debugger breaks on the ud2 instruction as expected. I tried fiddling build options and also tried making it call SetUnhandledExceptionFilter / AddVectoredExceptionHandler prior to the panic (Gecko does both of those) to see if it made any difference, but I couldn't get it to exhibit the behavior I see in Gecko.
Assignee: nobody → dmajor
This is what I see with the 2016-09-08 nightly. The names are a little different on my machine, but it's the same idea.

(1384.135c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlDispatchException+0x246:
00007ffb`074a85a6 498b00          mov     rax,qword ptr [r8] ds:00000000`00000000=????????????????
0:000> kL
 # Child-SP          RetAddr           Call Site
00 000000a4`36bf41c0 00007ffb`07508a3a ntdll!RtlDispatchException+0x246
01 000000a4`36bf48c0 00007ffa`e1b1dc2d ntdll!KiUserExceptionDispatch+0x3a
02 000000a4`36bf4fd0 00007ffa`e1b1f56a xul!from+0x145d
03 000000a4`36bf5560 00007ffa`e1b1f150 xul!from+0x2d9a
04 000000a4`36bf55b0 00007ffa`e1b0d1b7 xul!from+0x2980
05 000000a4`36bf5640 0000db53`4d470a00 xul!intentional_panic+0x1627
06 000000a4`36bf5648 00007ffa`e1bbe77b 0x0000db53`4d470a00
07 000000a4`36bf5650 00000000`00000000 xul!nsComponentManagerImpl::GetService+0xb3

When we execute the ud2 at "xul!from+0x145d", the exception machinery kicks in, and ntdll!RtlDispatchException does an unwind by calling ntdll!RtlLookupFunctionEntry on each frame. The first three lookups return reasonable-looking results. The lookup for intentional_panic returns nullptr. From there, it seems like the unwinder tries to recover by guessing, but it ends up walking bogus frames and eventually crashing on one of them.

For some reason, Ted's sample program does have unwind data for intentional_panic. RtlDispatchException  is able to get all the way down the stack to the unhandled exception filter.

(In reply to Alex Crichton [:acrichto] from comment #1)
> All our `extern` functions assume they will not be
> unwound through. If, however, you've installed a panic hook to abort
> unwinding (or you're compiling with panic=abort) then this'll be fine as
> it'll just abort the process.

It seems that aborting the process still requires unwind data if we want to walk back down to Breakpad.
(In reply to :dmajor from comment #7)
> The lookup for intentional_panic returns
> nullptr. From there, it seems like the unwinder tries to recover by
> guessing, but it ends up walking bogus frames and eventually crashing on one of them.

A minor clarification: when unwind info isn't found, the unwinder assumes it's in a leaf function, and it looks for the next frame through the stack pointer. So it's not really an attempt to "recover" from bad data.
Ted, we plan to make Win64 Firefox the default architecture for new installs in Firefox ~53. Should this Rust panic bug block that switch? If this bug is "just" about losing Breakpad crash reports for Rust panics, it doesn't seem like a blocker.
Flags: needinfo?(ted)
OS: Unspecified → Windows
Hardware: Unspecified → x86_64
(In reply to Chris Peterson [:cpeterson] from comment #9)
> Ted, we plan to make Win64 Firefox the default architecture for new installs
> in Firefox ~53. Should this Rust panic bug block that switch? If this bug is
> "just" about losing Breakpad crash reports for Rust panics, it doesn't seem
> like a blocker.

I'm not totally clear on whether this would break non-intentionally-panicing Rust code, that could use some more testing. If it does then that would mean that when we hit a Rust panic on Win64 we'd get the Windows Error Reporting dialog instead of the Mozilla crash reporter, and we'd lose out on crash reports for those situations. Obviously given the small amount of Rust code we have in-tree right now it's not a huge problem, but we should make sure this is sorted out before we get much farther.
Flags: needinfo?(ted)
David, is this panic unwinding bug a problem with Gecko or Rust? Who should look at this problem next?
Flags: needinfo?(dmajor)
I think Rust would need to drop its assumption that extern functions will not be unwound through. (Or at least have some attribute to allow it selectively.)
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #12)
> I think Rust would need to drop its assumption that extern functions will
> not be unwound through. (Or at least have some attribute to allow it
> selectively.)

Alex, how do you recommend we handle this issue? Is it safe for Rust to assume extern functions can unwind?
Flags: needinfo?(acrichton)
Oh sorry I missed some of the messages here! Let me try to respond to a few all at once.

> The lookup for intentional_panic returns nullptr. From there, it seems
> like the unwinder tries to recover by guessing, but it ends up walking
> bogus frames and eventually crashing on one of them.

This sounds like the crux of the problem. It sounds like on an illegal instruction the kernel will *always* unwind the stack? Then in Ted's example program some unwind info is there, so it unwinds to the point where there's no handler, and then aborts. But in Firefox the unwind info for the Rust code is missing, the unwinder can't go all the way and hard aborts?

(just clarifying the current situation)

> I think Rust would need to drop its assumption that extern
> functions will not be unwound through

Unfortunately this is likely to be pretty difficult from our end for a number of reasons. It'd be best if we could figure out another solution before resorting to this at least!

One idea could be to test out panic=unwind instead of just panic=abort. If that solves the issues on Windows it's at least a starting point! My next idea would be to dig into why a normal program is working yet code in Firefox isn't. There may be some random codegen or linker option we need to pass and aren't, maybe?
Flags: needinfo?(acrichton)
I tried fiddling a bunch of compile options in my test program and I could not get it to reproduce. The only thing that I haven't tested that I know Firefox does differently is using RtlAddFunctionTable to add unwind entries for JIT frames:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms680588(v=vs.85).aspx

https://dxr.mozilla.org/mozilla-central/rev/28e2a6dde76ab6ad4464a3662df1bd57af04398a/js/src/jit/ExecutableAllocatorWin.cpp#178
> This sounds like the crux of the problem. It sounds like on an illegal
> instruction the kernel will *always* unwind the stack? Then in Ted's example
> program some unwind info is there, so it unwinds to the point where there's
> no handler, and then aborts. But in Firefox the unwind info for the Rust
> code is missing, the unwinder can't go all the way and hard aborts?
> 
> (just clarifying the current situation)

That's right. In order to use the crash reporter, the NT exception machinery needs to be able to unwind all the way, since the Breakpad handler is at the bottom.

> One idea could be to test out panic=unwind instead of just panic=abort. If
> that solves the issues on Windows it's at least a starting point! My next
> idea would be to dig into why a normal program is working yet code in
> Firefox isn't. There may be some random codegen or linker option we need to
> pass and aren't, maybe?

Does panic=unwind run destructors, etc.? If so, I don't think we'd want that. We want to abort right away without cleanup, but we need the Breakpad handler to be able to catch and report that abort.

If there is some compiler state that controls the unwind info, that would be most useful, but I worry it may be very difficult pin down given the size of Gecko.
When compiling with panic=unwind it should *only* run destructors for Rust-originating exceptions. We "throw an exception" with special codes and such indicating it originated from Rust, and our destructors should ignore all other exceptions (like segfaults). So in theory an illegal instruction or another Gecko-generated fault won't run Rust destructors.

AFAIK we don't do anything "special" here on panic=abort. Put another way, Rust should look exactly like C. Speaking of which, what happens if there's some C code that triggers the illegal instruction? Does it exhibit the same behavior as Rust or does it unwind correctly? If it works correctly then there's definitely some difference we need to fix up!
Without having tested it, my expectation is that if C code does a ud2, we'd still have enough SEH data to unwind back to Breakpad.
Yeah that's what I'd expect as well. If that's the case, I assume that clang is supported, right? In that case it'd be useful to analyze the C++ -> C function call, specifically the LLVM IR to see what's different about the C++ -> C call from the C++ -> Rust call.
Alex, do you remember where we left off on this?
Assignee: dmajor → nobody
Flags: needinfo?(acrichton)
IIRC it had something to do with the Rust compiler not emitting .pdata for some functions. I think it was something along the lines of normal functions (e.g. fn foo() {}) have .pdata but extern functions don't (e.g. extern fn foo() {}).

I don't think we successfully tracked down why .pdata was missing, though.
Flags: needinfo?(acrichton)
>When compiling with panic=unwind it should *only* run destructors for Rust-originating exceptions.

I'd like to dispute this claim. In my own personal testing with 64bit MSVC, panics fired by Rust that unwound into C++ code would trigger C++ destructors, and exceptions thrown by C++ that unwound into Rust code would trigger Rust drops.
No longer blocks: win64-rollout
Whiteboard: [stylo]
Btw dmajor if it helps there's some logs of a convo we had awhile back which may have some info: https://gist.github.com/alexcrichton/9d56840757d55e9af86c92fa8926a3d2

So it sounded like it boiled down to missing pdata, but this seems curious to me when I finally got around to trying to reproduce it locally! So locally I have:


$ rustc -vV
rustc 1.15.1 (021bd294c 2017-02-08)
binary: rustc
commit-hash: 021bd294c039bd54aa5c4aa85bcdffb0d24bc892
commit-date: 2017-02-08
host: x86_64-pc-windows-msvc
release: 1.15.1
LLVM version: 3.9


And then if I compile a Rust file that looks like

#[no_mangle]
pub fn foo() {}

#[no_mangle]
pub extern fn bar() {}


with


$ rustc --crate-type dylib foo.rs -C prefer-dynamic


and then I run


$ dumpbin -pdata .\foo.dll
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file .\foo.dll

File Type: DLL

Function Table (35)

           Begin    End      Info      Function Name

  00000000 00001000 00001003 00002508  foo
  0000000C 00001010 00001013 00002510  bar
  00000018 00001020 00001070 0000257C
  00000024 00001070 0000119B 00002518
  00000030 0000119C 0000121E 0000254C
  0000003C 00001220 00001315 00002598
  00000048 00001318 0000136C 00002584
  00000054 0000136C 000013A9 000025C8  _DllMainCRTStartup
  00000060 000013AC 000013E5 00002610  __scrt_acquire_startup_lock
  0000006C 000013E8 0000141C 00002658  __scrt_dllmain_after_initialize_c
  00000078 0000141C 00001431 00002650  __scrt_dllmain_before_initialize_c
  00000084 00001434 0000145C 00002670  __scrt_dllmain_crt_thread_attach
  00000090 0000145C 00001471 00002678  __scrt_dllmain_crt_thread_detach
  0000009C 00001474 000014D5 0000263C  __scrt_dllmain_exception_filter
  000000A8 000014D8 00001508 00002660  __scrt_dllmain_uninitialize_c
  000000B4 00001508 0000151C 00002668  __scrt_dllmain_uninitialize_critical
  000000C0 0000151C 00001565 00002620  __scrt_initialize_crt
  000000CC 00001568 00001631 00002630  __scrt_initialize_onexit_tables
  000000D8 00001634 000016CD 000025E8  __scrt_is_nonwritable_in_current_image
  000000E4 000016D0 000016F4 00002618  __scrt_release_startup_lock
  000000F0 000016F4 0000171F 00002628  __scrt_uninitialize_crt
  000000FC 00001720 0000176F 000025E0  _onexit
  00000108 00001770 00001787 000025D8  atexit
  00000114 00001788 00001834 00002680  __security_init_cookie
  00000120 00001834 00001857 0000268C  DllMain
  0000012C 00001884 0000189F 00002694  __scrt_initialize_default_local_stdio_options
  00000138 000018A8 000019ED 0000269C  __scrt_fastfail
  00000144 000019F0 00001A3A 000026AC  _RTC_Initialize
  00000150 00001A3C 00001A86 000026BC  _RTC_Terminate
  0000015C 00001A90 00001C56 000026CC  __isa_available_init
  00000168 00001CF0 00001CF2 000026E0  _guard_dispatch_icall_nop
  00000174 00001CF2 00001D09 00002544
  00000180 00001D09 00001D25 00002574
  0000018C 00001D25 00001D5B 000025C0
  00000198 00001D5B 00001D73 00002608

  Summary

        1000 .data
        1000 .gfids
        1000 .pdata
        1000 .rdata
        1000 .reloc
        1000 .rustc
        1000 .text



which looks like both `foo` and `bar` have pdata? Or am I maybe misinterpreting this?
Ok, further investigation reveals:

     # has pdata for 'bar'
     rustc foo.rs --crate-type cdylib -O -C panic=unwind

     # does not have pdata for 'bar'
     rustc foo.rs --crate-type cdylib -O -C panic=abort

The difference in IR reveals:

$ diff -u good.ll bad.ll                                                                                                                                                                                            --- good.ll     2017-03-14 12:10:20.938611700 -0700
+++ bad.ll      2017-03-14 12:10:12.128712900 -0700
@@ -3,16 +3,16 @@
 target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"
 target triple = "x86_64-pc-windows-msvc"

-; Function Attrs: norecurse nounwind readnone uwtable
+; Function Attrs: norecurse nounwind readnone
 define void @foo() unnamed_addr #0 {
 entry-block:
   ret void
 }

-; Function Attrs: norecurse nounwind readnone uwtable
+; Function Attrs: norecurse nounwind readnone
 define void @bar() unnamed_addr #0 {
 entry-block:
   ret void
 }

-attributes #0 = { norecurse nounwind readnone uwtable }
+attributes #0 = { norecurse nounwind readnone }


which makes me think that my previous findings (https://gist.github.com/alexcrichton/9d56840757d55e9af86c92fa8926a3d2#file-dmajor-2016-11-18-log-L43) were mistaken (or I forgot the panic=abort flag).


So that would lead me to the conclusion that we are not applying the `uwtable` attribute in LLVM to generate uwtables for these functions which would appear to be the case (https://github.com/rust-lang/rust/blob/master/src/librustc_trans/base.rs#L610-L612) when compiling with `-C panic=abort`. This leads me to the conclusion:

* Can you confirm that compiling *without* `-C panic=abort` fixes the issue?
* We can then land a patch in upstream rustc to do this on windows regardless of `-C panic=abort`
> * Can you confirm that compiling *without* `-C panic=abort` fixes the issue?

I'll look into this later today.
Flags: needinfo?(dmajor)
> > * Can you confirm that compiling *without* `-C panic=abort` fixes the issue?

Confirmed. I changed to panic=unwind and now we're getting .pdata:

$ dumpbin -pdata xul.dll | grep intentional_panic
  00000000 00040330 000403C6 03169A38  intentional_panic

I stuck a `ud2` inside the panic code (to restore the `panic=abort` behavior) then called Ted's test function, and it successfully popped up Breakpad!
Flags: needinfo?(dmajor)
It would still be nice to have a non-Firefox testcase to reproduce this issue. Per comment 15, my best guess is that it would involve using RtlAddFunctionTable.
Excellent, thanks for confirming David!

I've opened a PR to make the appropriate change to rustc: https://github.com/rust-lang/rust/pull/40549

Unfortunately I didn't have an idea of much of a test to add :(
(In reply to Alex Crichton [:acrichto] from comment #28)
> Unfortunately I didn't have an idea of much of a test to add :(

Although a big-picture test would be nice, maybe we could just verify that the .ll has `uwtable`, or that a simple binary has the right pdata?
I've attempted to add a test that just verifies uwtable is emitted
The PR against rust-lang/rust has now landed, so the next nightly (2017-03-24) should have the fix as well as the 1.18.0 release on 2017-06-08
Thanks, Alex.

Rust 1.18.0 release on 2017-06-08 means we should be able to ship this fix in Firefox 57, which will ride from Nightly to Aurora on 2017-08-07.
Blocks: 1348896
Summary: Rust panic handling broken on Win64? → Rust panic handling broken on Win64
Interestingly, I landed a test with bug 1300152, and I marked it as a known failure on Win64, but in practice it passes sometimes: bug 1352647. All the instances in the OrangeFactor log are from PGO builds, but I don't know if that's coincidence.
Will PGO affect Rust object code or just MSVC's?
(In reply to Chris Peterson [:cpeterson] from comment #34)
> Will PGO affect Rust object code or just MSVC's?

Rust is currently incapable of emitting MSVC LTCG bytecode, therefore making it unable to take advantage of MSVC LTCG, including PGO. If you use LTCG or PGO, it will only apply to C/C++ compiled with /GL. Also, you can never use kind=static to link to C/C++ static libraries which were compiled with /GL because it *will* break things.
Depends on: 1365300
Ted, Ralph has upgraded the Win64 builds of Nightly 55 to Rust 1.18.0-beta.4 (bug 1369115), which includes Alex's uwtable fix. Can you please confirm that Win64 panics are working in Nightly 55 and whether the rustPanic test you marked as a known failure on Win64 (in bug 1300152) is now expected to pass?
Flags: needinfo?(ted)
Using nsIDebug2.rustPanic from the browser console like so:
 Cc["@mozilla.org/xpcom/debug;1"].getService(Ci.nsIDebug2).rustPanic("Crashme!");

I produced a crash report on my Win64 nightly (56.0a1 (2017-06-13):
https://crash-stats.mozilla.com/report/index/305c53fc-89bd-4a88-8684-07ec10170613

The signature is a little weird, we should sort out what's happening there.
Flags: needinfo?(ted)
I think it's just identical code folding:

0:038> ln xul!alloc::oom::default_oom_handler
...
Exact matches:
    xul!std::panicking::rust_panic (void)
    xul!alloc::oom::default_oom_handler (void)
Ted, is this a Gecko issue you can investigate or do we need a Rust person to dig deeper?
Flags: needinfo?(ted)
Per comment 38 it's not a bug, just a quirk of the MSVC linker that we deal with in other places.
Flags: needinfo?(ted)
The try push is a patch to re-enable the test on Win64.
Assignee: nobody → ted
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #41)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=29bae58063bfee6431a903e26ccfd6405b3ce3b6

test_crash_rust_panic.js passed in all the Win64 xpcshell runs I triggered. (There were a pair of known intermittent failures in other tests.)
Comment on attachment 8878429 [details]
bug 1302078 - re-enable test_crash_rust_panic.js on win64.

https://reviewboard.mozilla.org/r/149780/#review154494
Attachment #8878429 - Flags: review?(dmajor) → review+
Pushed by tmielczarek@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b9d6b100ffd9
re-enable test_crash_rust_panic.js on win64. r=dmajor
https://hg.mozilla.org/mozilla-central/rev/b9d6b100ffd9
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
It sounds like this is not required for 55, so setting status to wontfix.  If I'm wrong please undo and request uplift to beta.
The thing that actually fixed this was bug 1369115, the patch here simply re-enabled a test in light of that.
You need to log in before you can comment on or make changes to this bug.