Windows AArch64 breakpad exception handler not working
Categories
(Toolkit :: Crash Reporting, defect, P1)
Tracking
()
People
(Reporter: overholt, Assigned: gsvelto)
References
(Blocks 1 open bug, )
Details
Attachments
(1 file)
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
I can reliable crash a tab (see bug 1542821) but no reports show up in about:crashes. This is on a Windows-on-ARM build using today's nightly.
Reporter | ||
Comment 1•6 years ago
|
||
Gabriele, Eric suggested I needinfo you. Is there anything I can get in my profile that would help diagnose the problem here?
Comment 2•6 years ago
|
||
nbp noted that a report for his crash was sent when the browser itself crashed (so presumably not just a tab crash), but there still weren't any entries in about:crashes.
Assignee | ||
Comment 3•6 years ago
|
||
This is certainly a regression, when I started working on getting crash reporting to work the about:crashes pages worked as expected. I'll investigate tomorrow. Andrew if you launch it from a console do you see any JS error when opening about:crashes?
Reporter | ||
Comment 4•6 years ago
|
||
I don't see any errors in the Browser Console or the devtools console (and nothing in the Windows Command Prompt, either; should I be using some special flag?) when opening about:crashes.
Assignee | ||
Comment 5•6 years ago
|
||
On my local build I cannot produce minidumps at all which is rather odd. Considering that the tests are failing too I fear there must be some major breakage.
Assignee | ||
Comment 6•6 years ago
|
||
Yep, no minidumps are being written upon a crash, we're back to square one :-|
Assignee | ||
Comment 7•6 years ago
|
||
From what I can tell the exception handler is not being called at all (which would also explain why all the tests were failing). I've poked it with windbg but my Windows debugging-fu is limited. Now I'm running a bisection to find where we regressed.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 8•6 years ago
|
||
I've been able to narrow this down to between the 20th and the 23rd of March with the former nightly having a working crash reporter and the latter not having it. Unfortunately I haven't had much lack narrowing it down further because in my local builds the exception handler doesn't trigger even when building the same version as the 20th (which is rev 25398e555020). I've tried both an opt and a debugging build with no luck.
Reporter | ||
Comment 9•6 years ago
|
||
FWIW I have crashes in about:crashes from April 3rd.
Reporter | ||
Comment 10•6 years ago
|
||
Assignee | ||
Comment 11•6 years ago
|
||
(In reply to Andrew Overholt [:overholt] from comment #10)
https://crash-stats.mozilla.org/report/index/782f6b43-ea44-4383-8371-d43ee0190403, for example.
That's a hung content process. The minidump generation happens in response to an explicit request from the main process without the exception handler being invoked so that should work even when the exception handler doesn't. Can you get a crash dump if you use about:crashcontent?
Reporter | ||
Comment 12•6 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #11)
(In reply to Andrew Overholt [:overholt] from comment #10)
https://crash-stats.mozilla.org/report/index/782f6b43-ea44-4383-8371-d43ee0190403, for example.
That's a hung content process. The minidump generation happens in response to an explicit request from the main process without the exception handler being invoked so that should work even when the exception handler doesn't. Can you get a crash dump if you use about:crashcontent?
No.
What about https://crash-stats.mozilla.org/report/index/cf04356a-597e-4733-b137-58e330190402 that about:crashes says is from April 2?
Assignee | ||
Comment 13•6 years ago
|
||
(In reply to Andrew Overholt [:overholt] from comment #12)
No.
What about https://crash-stats.mozilla.org/report/index/cf04356a-597e-4733-b137-58e330190402 that about:crashes says is from April 2?
It looks like a valid crash so we're getting some... but not all of them apparently.
Comment 14•6 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #13)
(In reply to Andrew Overholt [:overholt] from comment #12)
No.
What about https://crash-stats.mozilla.org/report/index/cf04356a-597e-4733-b137-58e330190402 that about:crashes says is from April 2?
It looks like a valid crash so we're getting some... but not all of them apparently.
That's a jit crash, maybe implying our jit exception handling hooks are working?
Assignee | ||
Comment 15•6 years ago
|
||
I further narrowed it down to between rev fd1adb9941e9 (good) and rev 8941a9be9141 (bad). That's with shippable builds, with local builds the exception handler never seem to work which is really odd. I'm now changing my compiler to what was used for the last shippable build where it worked in the hope that minimizing the differences between my local build and the shippable one will lead me to the solution.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 16•6 years ago
|
||
I may have struck go with my last test. Downgrading rust to 1.32.0 got the crashreporter working again on rev fd1adb9941e9. I'm now trying rev 8941a9be9141 too.
Assignee | ||
Comment 17•6 years ago
|
||
Yup, if build rev 8941a9be9141 with rust 1.32.0 the exception handler works fine, if I use the current nightly it doesn't. I'll have to bisect rustc tomorrow :(
Assignee | ||
Comment 18•6 years ago
|
||
I'm updating the title to reflect the issue here.
I tested building rev 8941a9be9141 with clang version 8.0.0 (tags/RELEASE_800/rc3 355016) and rust 1.32.0 and the exception handler works fine. Upgrading only clang to version 8.0.0 (tags/RELEASE_800/final 356365) or only rust to version 1.33.0 also works but if I build with both then it stops working. It seems to be some kind of interaction between the two. The JIT exception handler seems unaffected as per comment 14.
David, Mike any idea as to what might be causing this? I hoped I would be able to narrow it down by bisecting clang but with also rust in the mix it will take ages.
Comment 19•6 years ago
|
||
Can you clarify what are your exact steps to test this? If we're talking specifically about the crash in comment 0, that may be a different story due to the nature of that failure. Or are crashes broken more generally?
Also, now that m-c has the Rust 1.34 release, does that change anything?
Assignee | ||
Comment 20•6 years ago
|
||
(In reply to David Major [:dmajor] (low availability) from comment #19)
Can you clarify what are your exact steps to test this? If we're talking specifically about the crash in comment 0, that may be a different story due to the nature of that failure. Or are crashes broken more generally?
JIT crashes work, regular ones (e.g. MOZ_CRASH()) don't. I've tried both content and main process crashes with about:crashcontent and about:crashparent
Also, now that m-c has the Rust 1.34 release, does that change anything?
No, I tried all the way to the current nightly. In short:
clang 8.0.0rc3 + rust 1.32 => works
clang 8.0.0 + rust 1.32 => works
clang 8.0.0rc3 + rust 1.33+ => works
clang 8.0.0 + rust 1.33+ => doesn't work
:froydnj pointed to these two recent LLVM commits affecting AArch64, the second one looks suspicious in this context:
https://github.com/llvm/llvm-project/commit/502c6557aa1aa6917fe5ce0ee26cfe936d45eb98
https://github.com/llvm/llvm-project/commit/d1786389da278381d02324739dd1bd5825c6f860
Comment 21•6 years ago
|
||
I tested about:crashparent on nightly 04-09, with e10s disabled to make debugging easier.
It looks like we can't unwind the frame for xul!Interpret
. WinDbg's k
command stops there, and so does the unwind in ntdll!RtlDispatchException
. On an x86_64 build, I can get much farther, well into the JIT frames, which is sufficient to reach Breakpad via our handler.
Would you mind testing a try build with that second LLVM commit reverted? I've got an awkward work setup today and can't get a new build onto my arm machine.
Assignee | ||
Comment 22•6 years ago
|
||
I haven't a local LLVM build handy but I could get one working on Monday. BTW how do you set breakpoints to hit ntdll!RtlDispatchException? I've tried by disabling first chance exception handling and breaking on ntdll!RtlDispatchException but somehow I never seem to get there in windbg.
Comment 23•6 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #22)
BTW how do you set breakpoints to hit ntdll!RtlDispatchException? I've tried
by disabling first chance exception handling and breaking on
ntdll!RtlDispatchException but somehow I never seem to get there in windbg.
Since breakpoints are implemented with exceptions, it's kind of a chicken and egg problem, so I do this horrible thing where I substitute an infinite loop. There might be a smarter way to do it but this gets the job done:
> r @$t0=poi(ntdll!RtlDispatchException) ; save the first instructions in a temp register
> ed ntdll!RtlDispatchException 14000000 ; infinite loop
> g ; Repro the crash, then break in
> !runaway ; Find which thread is spinning
> ~4s ; Switch to that thread
> eq ntdll!RtlDispatchException @$t0 ; restore the first instruction
Assignee | ||
Comment 24•6 years ago
|
||
(In reply to David Major [:dmajor] (low availability) from comment #23)
Since breakpoints are implemented with exceptions, it's kind of a chicken and egg problem, so I do this horrible thing where I substitute an infinite loop. There might be a smarter way to do it but this gets the job done:
> r @$t0=poi(ntdll!RtlDispatchException) ; save the first instructions in a temp register > ed ntdll!RtlDispatchException 14000000 ; infinite loop > g ; Repro the crash, then break in > !runaway ; Find which thread is spinning > ~4s ; Switch to that thread > eq ntdll!RtlDispatchException @$t0 ; restore the first instruction
Thanks, that's neat. I tried using sxd bpe
to try and let the exception go to breakpad before going to the debugger but it doesn't seem to work. windbg documentation distinguishes between break and handle status for the different exception types but I haven't figured out the right combination of options (provided one exists).
I'm now building clang locally to analyze the changes in comment 20 and if that doesn't work I'll just bisect it (\me sighs deeply).
Updated•6 years ago
|
Comment 25•6 years ago
|
||
Note you don't necessarily need to build llvm locally. You can tweak the llvm versions in build/build-clang/*.json and push to try.
Assignee | ||
Comment 26•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #25)
Note you don't necessarily need to build llvm locally. You can tweak the llvm versions in build/build-clang/*.json and push to try.
Thanks for the tip. I'm almost done with the bisection on my local machine but this would have been very helpful (especially because I spent quite a bit of time just setting up the build).
Assignee | ||
Comment 27•6 years ago
|
||
Contrary to my original assessment it turns out that it's change 502c6557aa1aa6917fe5ce0ee26cfe936d45eb98 that's breaking unwinding. David's comment should have pointed me to it since unwinding gets stuck in the interpreter [1] which does indeed use a jump-table. The problems is still present in upstream LLVM so to unblock us I'll add a local patch to clang reverting the change. I'll also file a bug with upstream to fix this.
Assignee | ||
Comment 28•6 years ago
|
||
Here's my attempt at fixing this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=868c0798eeefec89cc6169acd96e9e0ab8970520
Assignee | ||
Comment 29•6 years ago
|
||
I can generate crash reports correctly with the build from comment 28, time for review.
Assignee | ||
Comment 30•6 years ago
|
||
Comment 31•6 years ago
|
||
Have you filed a bug upstream for this?
Assignee | ||
Comment 32•6 years ago
|
||
(In reply to David Major [:dmajor] (low availability) from comment #31)
Have you filed a bug upstream for this?
Not yet, I'm trying to prepare a reduced test-case before I do.
Comment 33•6 years ago
|
||
I think you might as well file it and CC the patch author and reviewers. There's a chance they may be able to "just see it" and save you the trouble. If not, nothing lost, and they may be able to make suggestions about the way to reduce it that would help them the most.
Assignee | ||
Comment 34•6 years ago
|
||
(In reply to David Major [:dmajor] (low availability) from comment #33)
I think you might as well file it and CC the patch author and reviewers. There's a chance they may be able to "just see it" and save you the trouble. If not, nothing lost, and they may be able to make suggestions about the way to reduce it that would help them the most.
Thanks, I'll do that. I've looked at the generated assembly and the only difference is the executable section where the jump table data is stored. Since I don't know how that could affect structured exception handling on Windows I have no clue why it's not working; hopefully they'll know more about it.
Comment 36•6 years ago
|
||
[Tracking Requested - why for this release]: It would be nice if our crashreporter worked in arm64 windows on 67, since that's the first release we're officially doing with arm64 windows support. The fix is small and arm64-specific.
Updated•6 years ago
|
Comment 37•6 years ago
|
||
Assignee | ||
Comment 38•6 years ago
|
||
I filed a bug upstream for this. I couldn't CC the patch author and reviewers because their email addresses weren't visible in the commit message. I looked for a bug related to that change but couldn't find it.
Comment 39•6 years ago
|
||
bugherder |
Comment 40•6 years ago
|
||
Gabriele, is that a patch we should uplift to beta or can it just ride the 68 train?
Assignee | ||
Comment 41•6 years ago
|
||
(In reply to Pascal Chevrel:pascalc from comment #40)
Gabriele, is that a patch we should uplift to beta or can it just ride the 68 train?
We want this in beta, because that's affected too and we're shipping it on AArch64/Win IIRC. Sorry, I forgot!
Comment 42•6 years ago
|
||
There is a better fix pending upstream at https://reviews.llvm.org/D61095. I think it's too late to rush it into beta, but as a followup we should replace our patch with upstream's when it lands.
Assignee | ||
Comment 43•6 years ago
|
||
Comment on attachment 9058745 [details]
Bug 1542827 - Revert a change to LLVM to fix unwinding to the exception handler on Windows/AArch64
Beta/Release Uplift Approval Request
- User impact if declined: Users will not be able to send crash reports
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: 1. Launch Firefox
- Navigate to about:crashcontent
- The "tab crashed" page should be shown complete with the form to submit the crash
- Navigate to about:crashparent
- Firefox should crash and the crash reporter dialog should appear
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The change removes the offending patch from the toolchain, no other code is affected. This functionality is covered by automated tests but they're still not working (for unrelated reasons).
- String changes made/needed: None
Assignee | ||
Updated•6 years ago
|
Comment 44•6 years ago
|
||
Comment on attachment 9058745 [details]
Bug 1542827 - Revert a change to LLVM to fix unwinding to the exception handler on Windows/AArch64
Uplift accepted for 67 beta 16, thanks.
Comment 45•6 years ago
|
||
bugherder uplift |
Comment 46•6 years ago
|
||
Reproduced the issue on affected Nightly 2019-04-14 on Lenovo Yoga laptop.
Verified - fixed on the latest Nightly 68.0a1 (2019-05-01) (64-bit) and latest Beta 67.0b16 (64-bit).
The crash reporter is displayed and the submitted crash report can be seen listed in about:crashes.
Description
•