Closed Bug 1542827 Opened 5 years ago Closed 5 years ago

Windows AArch64 breakpad exception handler not working

Categories

(Toolkit :: Crash Reporting, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla68
Tracking Status
firefox67 + verified
firefox68 + verified

People

(Reporter: overholt, Assigned: gsvelto)

References

(Blocks 1 open bug, )

Details

Attachments

(1 file)

I can reliable crash a tab (see bug 1542821) but no reports show up in about:crashes. This is on a Windows-on-ARM build using today's nightly.

Gabriele, Eric suggested I needinfo you. Is there anything I can get in my profile that would help diagnose the problem here?

Flags: needinfo?(gsvelto)

nbp noted that a report for his crash was sent when the browser itself crashed (so presumably not just a tab crash), but there still weren't any entries in about:crashes.

This is certainly a regression, when I started working on getting crash reporting to work the about:crashes pages worked as expected. I'll investigate tomorrow. Andrew if you launch it from a console do you see any JS error when opening about:crashes?

I don't see any errors in the Browser Console or the devtools console (and nothing in the Windows Command Prompt, either; should I be using some special flag?) when opening about:crashes.

On my local build I cannot produce minidumps at all which is rather odd. Considering that the tests are failing too I fear there must be some major breakage.

Flags: needinfo?(gsvelto)

Yep, no minidumps are being written upon a crash, we're back to square one :-|

From what I can tell the exception handler is not being called at all (which would also explain why all the tests were failing). I've poked it with windbg but my Windows debugging-fu is limited. Now I'm running a bisection to find where we regressed.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED

I've been able to narrow this down to between the 20th and the 23rd of March with the former nightly having a working crash reporter and the latter not having it. Unfortunately I haven't had much lack narrowing it down further because in my local builds the exception handler doesn't trigger even when building the same version as the 20th (which is rev 25398e555020). I've tried both an opt and a debugging build with no luck.

FWIW I have crashes in about:crashes from April 3rd.

(In reply to Andrew Overholt [:overholt] from comment #10)

https://crash-stats.mozilla.org/report/index/782f6b43-ea44-4383-8371-d43ee0190403, for example.

That's a hung content process. The minidump generation happens in response to an explicit request from the main process without the exception handler being invoked so that should work even when the exception handler doesn't. Can you get a crash dump if you use about:crashcontent?

(In reply to Gabriele Svelto [:gsvelto] from comment #11)

(In reply to Andrew Overholt [:overholt] from comment #10)

https://crash-stats.mozilla.org/report/index/782f6b43-ea44-4383-8371-d43ee0190403, for example.

That's a hung content process. The minidump generation happens in response to an explicit request from the main process without the exception handler being invoked so that should work even when the exception handler doesn't. Can you get a crash dump if you use about:crashcontent?

No.

What about https://crash-stats.mozilla.org/report/index/cf04356a-597e-4733-b137-58e330190402 that about:crashes says is from April 2?

(In reply to Andrew Overholt [:overholt] from comment #12)

No.

What about https://crash-stats.mozilla.org/report/index/cf04356a-597e-4733-b137-58e330190402 that about:crashes says is from April 2?

It looks like a valid crash so we're getting some... but not all of them apparently.

(In reply to Gabriele Svelto [:gsvelto] from comment #13)

(In reply to Andrew Overholt [:overholt] from comment #12)

No.

What about https://crash-stats.mozilla.org/report/index/cf04356a-597e-4733-b137-58e330190402 that about:crashes says is from April 2?

It looks like a valid crash so we're getting some... but not all of them apparently.

That's a jit crash, maybe implying our jit exception handling hooks are working?

I further narrowed it down to between rev fd1adb9941e9 (good) and rev 8941a9be9141 (bad). That's with shippable builds, with local builds the exception handler never seem to work which is really odd. I'm now changing my compiler to what was used for the last shippable build where it worked in the hope that minimizing the differences between my local build and the shippable one will lead me to the solution.

Priority: -- → P1

I may have struck go with my last test. Downgrading rust to 1.32.0 got the crashreporter working again on rev fd1adb9941e9. I'm now trying rev 8941a9be9141 too.

Yup, if build rev 8941a9be9141 with rust 1.32.0 the exception handler works fine, if I use the current nightly it doesn't. I'll have to bisect rustc tomorrow :(

I'm updating the title to reflect the issue here.

I tested building rev 8941a9be9141 with clang version 8.0.0 (tags/RELEASE_800/rc3 355016) and rust 1.32.0 and the exception handler works fine. Upgrading only clang to version 8.0.0 (tags/RELEASE_800/final 356365) or only rust to version 1.33.0 also works but if I build with both then it stops working. It seems to be some kind of interaction between the two. The JIT exception handler seems unaffected as per comment 14.

David, Mike any idea as to what might be causing this? I hoped I would be able to narrow it down by bisecting clang but with also rust in the mix it will take ages.

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(dmajor)
Summary: Tab crashes not showing up in about:crashes → Windows AArch64 breakpad exception handler not working

Can you clarify what are your exact steps to test this? If we're talking specifically about the crash in comment 0, that may be a different story due to the nature of that failure. Or are crashes broken more generally?

Also, now that m-c has the Rust 1.34 release, does that change anything?

(In reply to David Major [:dmajor] (low availability) from comment #19)

Can you clarify what are your exact steps to test this? If we're talking specifically about the crash in comment 0, that may be a different story due to the nature of that failure. Or are crashes broken more generally?

JIT crashes work, regular ones (e.g. MOZ_CRASH()) don't. I've tried both content and main process crashes with about:crashcontent and about:crashparent

Also, now that m-c has the Rust 1.34 release, does that change anything?

No, I tried all the way to the current nightly. In short:

clang 8.0.0rc3 + rust 1.32 => works
clang 8.0.0 + rust 1.32 => works
clang 8.0.0rc3 + rust 1.33+ => works
clang 8.0.0 + rust 1.33+ => doesn't work

:froydnj pointed to these two recent LLVM commits affecting AArch64, the second one looks suspicious in this context:

https://github.com/llvm/llvm-project/commit/502c6557aa1aa6917fe5ce0ee26cfe936d45eb98
https://github.com/llvm/llvm-project/commit/d1786389da278381d02324739dd1bd5825c6f860

I tested about:crashparent on nightly 04-09, with e10s disabled to make debugging easier.

It looks like we can't unwind the frame for xul!Interpret. WinDbg's k command stops there, and so does the unwind in ntdll!RtlDispatchException. On an x86_64 build, I can get much farther, well into the JIT frames, which is sufficient to reach Breakpad via our handler.

Would you mind testing a try build with that second LLVM commit reverted? I've got an awkward work setup today and can't get a new build onto my arm machine.

Flags: needinfo?(dmajor)

I haven't a local LLVM build handy but I could get one working on Monday. BTW how do you set breakpoints to hit ntdll!RtlDispatchException? I've tried by disabling first chance exception handling and breaking on ntdll!RtlDispatchException but somehow I never seem to get there in windbg.

(In reply to Gabriele Svelto [:gsvelto] from comment #22)

BTW how do you set breakpoints to hit ntdll!RtlDispatchException? I've tried
by disabling first chance exception handling and breaking on
ntdll!RtlDispatchException but somehow I never seem to get there in windbg.

Since breakpoints are implemented with exceptions, it's kind of a chicken and egg problem, so I do this horrible thing where I substitute an infinite loop. There might be a smarter way to do it but this gets the job done:

> r @$t0=poi(ntdll!RtlDispatchException) ; save the first instructions in a temp register
> ed ntdll!RtlDispatchException 14000000 ; infinite loop
> g ; Repro the crash, then break in
> !runaway ; Find which thread is spinning
> ~4s ; Switch to that thread
> eq ntdll!RtlDispatchException @$t0 ; restore the first instruction

(In reply to David Major [:dmajor] (low availability) from comment #23)

Since breakpoints are implemented with exceptions, it's kind of a chicken and egg problem, so I do this horrible thing where I substitute an infinite loop. There might be a smarter way to do it but this gets the job done:

> r @$t0=poi(ntdll!RtlDispatchException) ; save the first instructions in a temp register
> ed ntdll!RtlDispatchException 14000000 ; infinite loop
> g ; Repro the crash, then break in
> !runaway ; Find which thread is spinning
> ~4s ; Switch to that thread
> eq ntdll!RtlDispatchException @$t0 ; restore the first instruction

Thanks, that's neat. I tried using sxd bpe to try and let the exception go to breakpad before going to the debugger but it doesn't seem to work. windbg documentation distinguishes between break and handle status for the different exception types but I haven't figured out the right combination of options (provided one exists).

I'm now building clang locally to analyze the changes in comment 20 and if that doesn't work I'll just bisect it (\me sighs deeply).

Flags: needinfo?(mh+mozilla)

Note you don't necessarily need to build llvm locally. You can tweak the llvm versions in build/build-clang/*.json and push to try.

(In reply to Mike Hommey [:glandium] from comment #25)

Note you don't necessarily need to build llvm locally. You can tweak the llvm versions in build/build-clang/*.json and push to try.

Thanks for the tip. I'm almost done with the bisection on my local machine but this would have been very helpful (especially because I spent quite a bit of time just setting up the build).

Contrary to my original assessment it turns out that it's change 502c6557aa1aa6917fe5ce0ee26cfe936d45eb98 that's breaking unwinding. David's comment should have pointed me to it since unwinding gets stuck in the interpreter [1] which does indeed use a jump-table. The problems is still present in upstream LLVM so to unblock us I'll add a local patch to clang reverting the change. I'll also file a bug with upstream to fix this.

[1] https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/js/src/vm/Interpreter.cpp#1653

I can generate crash reports correctly with the build from comment 28, time for review.

Have you filed a bug upstream for this?

(In reply to David Major [:dmajor] (low availability) from comment #31)

Have you filed a bug upstream for this?

Not yet, I'm trying to prepare a reduced test-case before I do.

I think you might as well file it and CC the patch author and reviewers. There's a chance they may be able to "just see it" and save you the trouble. If not, nothing lost, and they may be able to make suggestions about the way to reduce it that would help them the most.

(In reply to David Major [:dmajor] (low availability) from comment #33)

I think you might as well file it and CC the patch author and reviewers. There's a chance they may be able to "just see it" and save you the trouble. If not, nothing lost, and they may be able to make suggestions about the way to reduce it that would help them the most.

Thanks, I'll do that. I've looked at the generated assembly and the only difference is the executable section where the jump table data is stored. Since I don't know how that could affect structured exception handling on Windows I have no clue why it's not working; hopefully they'll know more about it.

Mike, can you review this today?

Flags: needinfo?(mh+mozilla)

[Tracking Requested - why for this release]: It would be nice if our crashreporter worked in arm64 windows on 67, since that's the first release we're officially doing with arm64 windows support. The fix is small and arm64-specific.

See Also: → 1544360
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/40f546f921f6
Revert a change to LLVM to fix unwinding to the exception handler on Windows/AArch64 r=glandium

I filed a bug upstream for this. I couldn't CC the patch author and reviewers because their email addresses weren't visible in the commit message. I looked for a bug related to that change but couldn't find it.

Flags: needinfo?(mh+mozilla)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Blocks: 1536221

Gabriele, is that a patch we should uplift to beta or can it just ride the 68 train?

Flags: needinfo?(gsvelto)

(In reply to Pascal Chevrel:pascalc from comment #40)

Gabriele, is that a patch we should uplift to beta or can it just ride the 68 train?

We want this in beta, because that's affected too and we're shipping it on AArch64/Win IIRC. Sorry, I forgot!

Flags: needinfo?(gsvelto)

There is a better fix pending upstream at https://reviews.llvm.org/D61095. I think it's too late to rush it into beta, but as a followup we should replace our patch with upstream's when it lands.

Comment on attachment 9058745 [details]
Bug 1542827 - Revert a change to LLVM to fix unwinding to the exception handler on Windows/AArch64

Beta/Release Uplift Approval Request

  • User impact if declined: Users will not be able to send crash reports
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: 1. Launch Firefox
  1. Navigate to about:crashcontent
  2. The "tab crashed" page should be shown complete with the form to submit the crash
  3. Navigate to about:crashparent
  4. Firefox should crash and the crash reporter dialog should appear
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change removes the offending patch from the toolchain, no other code is affected. This functionality is covered by automated tests but they're still not working (for unrelated reasons).
  • String changes made/needed: None
Attachment #9058745 - Flags: approval-mozilla-beta?
Flags: qe-verify+

Comment on attachment 9058745 [details]
Bug 1542827 - Revert a change to LLVM to fix unwinding to the exception handler on Windows/AArch64

Uplift accepted for 67 beta 16, thanks.

Attachment #9058745 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Reproduced the issue on affected Nightly 2019-04-14 on Lenovo Yoga laptop.

Verified - fixed on the latest Nightly 68.0a1 (2019-05-01) (64-bit) and latest Beta 67.0b16 (64-bit).
The crash reporter is displayed and the submitted crash report can be seen listed in about:crashes.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: