crash reporting AWOL on Windows aarch64 builds
Categories
(Toolkit :: Crash Reporting, defect)
Tracking
()
People
(Reporter: steven, Assigned: gsvelto)
References
(Blocks 2 open bugs)
Details
Attachments
(1 file)
47 bytes,
text/x-phabricator-request
|
lizzard
:
approval-mozilla-beta+
|
Details | Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0
Steps to reproduce:
- Run Firefox Nightly build 20190204214259
- Navigate to about:crashparent
Actual results:
Browser crashes, but crash reporter doesn't trap it or report it. (Other crashes don't get trapped or reported either.)
Also the crashes do not show up in about:crashes.
Expected results:
The crash reporter should show up.
Reporter | ||
Comment 1•6 years ago
|
||
- gsvelto to CC
Assignee | ||
Comment 2•6 years ago
|
||
I don't have my ARM64 laptop handy now, I'll have a look on Monday or Tuesday at the latest.
Comment 3•6 years ago
|
||
I managed to reproduce this issue on Lenovo Yoga C630-13Q50 with Windows 10 Home (v1803) on Firefox Nightly 67.0a1 (2019-02-11) aarch64 builds.
Assignee | ||
Comment 5•6 years ago
|
||
Looking into this right now.
Assignee | ||
Comment 6•6 years ago
|
||
No minidumps are being generated at all, I'm running a bisection for lack of better ideas as to what the cause might be.
Assignee | ||
Comment 7•6 years ago
|
||
OK, this is feeling increasingly nightmarish. For a while in the middle of the bisection range Firefox just didn't start at all on Win/AARch64 which is making things really complicated.
crash-stats stopped receiving reports around the end of January or beginning of February: https://crash-stats.mozilla.com/search/?release_channel=nightly&platform=Windows&cpu_arch=arm64&cpu_arch=0x000c&product=Firefox&date=%3E%3D2019-01-14T21%3A15%3A00.000Z&date=%3C2019-02-14T21%3A15%3A00.000Z&_facets=signature&_facets=build_id&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id
Comment 10•6 years ago
|
||
crash-stats stopped receiving reports around the end of January or beginning of February:
Bug 1518947 is looking suspicious in that light...
Assignee | ||
Comment 11•6 years ago
|
||
(In reply to David Major [:dmajor] from comment #10)
crash-stats stopped receiving reports around the end of January or beginning of February:
Bug 1518947 is looking suspicious in that light...
That was also my first guess but I tested it and that's not it.
Assignee | ||
Comment 12•6 years ago
|
||
There's something really odd going on. I built versions back to before nightly 20190131093752 for which we have crash-reports and none work. I'm wondering, when did we switch building to clang? I did all my tests with clang and I suddenly remembered about bug 1424304 where I hit a clang bug that caused the exception handler not to be invoked.
Comment 13•6 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
There's something really odd going on. I built versions back to before
nightly 20190131093752 for which we have crash-reports and none work. I'm
wondering, when did we switch building to clang? I did all my tests with
clang and I suddenly remembered about bug 1424304 where I hit a clang bug
that caused the exception handler not to be invoked.
The 24th: https://bugzilla.mozilla.org/show_bug.cgi?id=1512822#c6
Assignee | ||
Comment 14•6 years ago
|
||
(In reply to David Major [:dmajor] from comment #13)
The 24th: https://bugzilla.mozilla.org/show_bug.cgi?id=1512822#c6
So that's not the likely cause either otherwise we wouldn't have received crash reports for the last week of January.
Comment 15•6 years ago
|
||
There are at least two issues happening, and they landed at different times, which explains the difficulty of bisection.
One is bug 1528304.
Another is that we don't seem to be doing the right thing when there are JIT frames on the stack. I've tried importing the unwind patches from bug 1527471, as well as disabling the hacky unwind info in bug 1484835, but it's still not working. I need to investigate more.
Reporter | ||
Comment 16•6 years ago
|
||
You are not authorized to access bug 1528304.
Security-related?
Comment 17•6 years ago
|
||
I did some more investigation, and long story short, our fake JIT function tables don't work on arm64. Or at least not fully. I'm pretty sure our setup works when a crash originates in a JIT frame, because I remember testing that during bug 1484835. But if you have a crash in C++ code (say, about:crashcontent
) and during the unwind there's a JIT frame further up the stack, our fake unwind info throws things off and we never reach Breakpad.
I have wt
traces of ntdll!RtlDispatchException
on both a good x86_64 build and a broken arm64 one. In both cases our function table callback gets called, but we diverge shortly after.
On x86_64, once RtlLookupFunctionEntry
returns our entry, RtlDispatchException
looks at it, finds its exception handler, and goes into RtlpExecuteHandlerForException
which eventually leads to Breakpad and all is well:
11 2 [ 3] xul!RuntimeFunctionCallback
52 45 [ 2] ntdll!RtlpLookupDynamicFunctionEntry
53 253 [ 1] ntdll!RtlLookupFunctionEntry
10135 19518 [ 0] ntdll!RtlDispatchException
4 0 [ 1] ntdll!RtlpExecuteHandlerForException
But arm64 tries to do a full unwind, which returns null for the exception handler (I'm guessing because our fake tables just aren't good enough):
11 3 [ 3] xul!RuntimeFunctionCallback
46 47 [ 2] ntdll!RtlpLookupDynamicFunctionEntry
37 224 [ 1] ntdll!RtlLookupFunctionEntry
5068 30567 [ 0] ntdll!RtlDispatchException
22 0 [ 1] ntdll!RtlpxVirtualUnwind
82 0 [ 2] ntdll!RtlpUnwindFunctionFull
Updated•6 years ago
|
Comment 18•6 years ago
|
||
A tab just crashed on me, and I saw the crash reporter. This on Arm64 windows nightly 67.0a1 (2019-02-20) (64-bit).
Updated•6 years ago
|
Comment 19•6 years ago
|
||
Gabriele, did the fix in bug 1528304 help at all?
Comment 20•6 years ago
|
||
(In reply to Liz Henry (:lizzard) (use needinfo) from comment #19)
Gabriele, did the fix in bug 1528304 help at all?
That bug fixed one side of the problem, and we've started to get some crash reports coming in, but comment 17 still stands in the way of many reports.
Comment 21•6 years ago
|
||
This works around the issue where if the PC and SP don't change while unwinding our JIT frame, we'll fail the unwinder's sanity checks and it won't call our exception handler.
Ideally we'd store proper unwind info, but that's a larger change for another day.
Assignee | ||
Comment 22•6 years ago
|
||
Yes, as David already said we're getting some crash reports (here's one with a decent stack-trace https://crash-stats.mozilla.com/report/index/2b3d6cdf-a5cf-4f3e-a3a4-03fd80190221) but not all of them.
Comment 23•6 years ago
|
||
Comment 24•6 years ago
|
||
bugherder |
Comment 25•6 years ago
|
||
:egao, want to try re-running xpcshell tests to see if this made any improvement?
Updated•6 years ago
|
Comment 26•6 years ago
•
|
||
(In reply to David Major [:dmajor] from comment #25)
:egao, want to try re-running xpcshell tests to see if this made any improvement?
For the post-patch revision, mozilla-central revision from 2/26 (PST) was used, revision d326a9d
.
Try:
- prior to this patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0abb56001522773bc87748f6042efc97dc45797
- post-patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8ed035a9b56d43c7aac643528075aef364bd6c0b
opt-xpcshell-5
failures appears to have been addressed.
Failures per chunk:
- opt-xpcshell-2: same failures between pre and post-patch
- opt-xpcshell-6: same as above
- opt-xpcshell-7: same as above
- opt-xpcshell-8: same as above
Failures in chunks 2, 6, 7 are few and far in between. Most tests appear to pass in the given chunks.
Failures in chunk 8 appears numerous. 71 failures, 337 passes - 21% failure rate, unchanged between patch.
Does this align with your expectations :dmajor?
Comment 27•6 years ago
|
||
(In reply to Edwin Gao (:egao) from comment #26)
Does this align with your expectations :dmajor?
Well, I was hoping to see that this fixed the test_crash_*
failures in X8, but it looks like they were already gone before this patch. I guess I can't complain!
Comment 28•6 years ago
|
||
Please nominate this for uplift whenever you feel comfortable.
Comment 29•6 years ago
•
|
||
:dmajor/:froydnj, I am reviewing the bug dependency tree for windows10-aarch64.
In bug 1525378 it was mentioned that lack of crashreporter was the likely cause for the failures. In the try push from comment 26 it is possible to see the same failures still occur after patch for this bug has been landed. The failure details are also similar if not identical.
What is the next step for bug 1525378, since the prevailing thought was that it would be fixed with this patch?
Comment 30•6 years ago
|
||
(In reply to Edwin Gao (:egao) from comment #29)
What is the next step for bug 1525378, since the prevailing thought was that
it would be fixed with this patch?
In this case the next step is to remove the assumption that it would be fixed by this patch, and get someone to debug it in its own right.
Comment 31•6 years ago
|
||
I'm assuming that https://crash-stats.mozilla.com/report/index/20bff942-5948-4a1e-8c9a-da3390190227 is a report that we would not have received without this patch, so I'll count that as validation.
Comment 32•6 years ago
|
||
Comment on attachment 9046010 [details]
Add a fake unwind code to arm64 JIT function entries
Beta/Release Uplift Approval Request
- Feature/Bug causing the regression: None
- User impact if declined: Missing crash reports on arm64
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: about:crashparent and about:crashcontent
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This code is compiled only on arm64, and it runs only after we've already crashed
- String changes made/needed: No
Comment 33•6 years ago
|
||
Comment on attachment 9046010 [details]
Add a fake unwind code to arm64 JIT function entries
Looks like this patch helped us get crash reporting working again.
Thanks! OK for beta 12 uplift.
Comment 34•6 years ago
|
||
bugherder uplift |
Updated•6 years ago
|
Updated•6 years ago
|
Comment 35•6 years ago
|
||
Verified as fixed on Firefox Nightly 67.0a1 (2019-02-28) and on Firefox 66.0b12 aarch64 builds on Lenovo Yoga C630-13Q50 with Windows 10 Home.
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Description
•