crash reporting AWOL on Windows aarch64 builds

VERIFIED FIXED in Firefox 66

Status

()

defect
VERIFIED FIXED
5 months ago
4 months ago

People

(Reporter: steven, Assigned: gsvelto)

Tracking

(Blocks 2 bugs)

67 Branch
mozilla67
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox66+ verified, firefox67+ verified)

Details

Attachments

(1 attachment)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0

Steps to reproduce:

  • Run Firefox Nightly build 20190204214259
  • Navigate to about:crashparent

Actual results:

Browser crashes, but crash reporter doesn't trap it or report it. (Other crashes don't get trapped or reported either.)

Also the crashes do not show up in about:crashes.

Expected results:

The crash reporter should show up.

  • gsvelto to CC

I don't have my ARM64 laptop handy now, I'll have a look on Monday or Tuesday at the latest.

I managed to reproduce this issue on Lenovo Yoga C630-13Q50 with Windows 10 Home (v1803) on Firefox Nightly 67.0a1 (2019-02-11) aarch64 builds.

Status: UNCONFIRMED → NEW
Component: Untriaged → Crash Reporting
Ever confirmed: true
Product: Firefox → Toolkit
Duplicate of this bug: 1524419

Looking into this right now.

No minidumps are being generated at all, I'm running a bisection for lack of better ideas as to what the cause might be.

OK, this is feeling increasingly nightmarish. For a while in the middle of the bisection range Firefox just didn't start at all on Win/AARch64 which is making things really complicated.

Duplicate of this bug: 1528026

crash-stats stopped receiving reports around the end of January or beginning of February:

Bug 1518947 is looking suspicious in that light...

(In reply to David Major [:dmajor] from comment #10)

crash-stats stopped receiving reports around the end of January or beginning of February:

Bug 1518947 is looking suspicious in that light...

That was also my first guess but I tested it and that's not it.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED

There's something really odd going on. I built versions back to before nightly 20190131093752 for which we have crash-reports and none work. I'm wondering, when did we switch building to clang? I did all my tests with clang and I suddenly remembered about bug 1424304 where I hit a clang bug that caused the exception handler not to be invoked.

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

There's something really odd going on. I built versions back to before
nightly 20190131093752 for which we have crash-reports and none work. I'm
wondering, when did we switch building to clang? I did all my tests with
clang and I suddenly remembered about bug 1424304 where I hit a clang bug
that caused the exception handler not to be invoked.

The 24th: https://bugzilla.mozilla.org/show_bug.cgi?id=1512822#c6

(In reply to David Major [:dmajor] from comment #13)

The 24th: https://bugzilla.mozilla.org/show_bug.cgi?id=1512822#c6

So that's not the likely cause either otherwise we wouldn't have received crash reports for the last week of January.

There are at least two issues happening, and they landed at different times, which explains the difficulty of bisection.

One is bug 1528304.

Another is that we don't seem to be doing the right thing when there are JIT frames on the stack. I've tried importing the unwind patches from bug 1527471, as well as disabling the hacky unwind info in bug 1484835, but it's still not working. I need to investigate more.

You are not authorized to access bug 1528304.

Security-related?

No longer blocks: 1525743

I did some more investigation, and long story short, our fake JIT function tables don't work on arm64. Or at least not fully. I'm pretty sure our setup works when a crash originates in a JIT frame, because I remember testing that during bug 1484835. But if you have a crash in C++ code (say, about:crashcontent) and during the unwind there's a JIT frame further up the stack, our fake unwind info throws things off and we never reach Breakpad.

I have wt traces of ntdll!RtlDispatchException on both a good x86_64 build and a broken arm64 one. In both cases our function table callback gets called, but we diverge shortly after.

On x86_64, once RtlLookupFunctionEntry returns our entry, RtlDispatchException looks at it, finds its exception handler, and goes into RtlpExecuteHandlerForException which eventually leads to Breakpad and all is well:

   11     2 [  3]       xul!RuntimeFunctionCallback
   52    45 [  2]     ntdll!RtlpLookupDynamicFunctionEntry
   53   253 [  1]   ntdll!RtlLookupFunctionEntry
10135 19518 [  0] ntdll!RtlDispatchException
    4     0 [  1]   ntdll!RtlpExecuteHandlerForException

But arm64 tries to do a full unwind, which returns null for the exception handler (I'm guessing because our fake tables just aren't good enough):

   11     3 [  3]       xul!RuntimeFunctionCallback
   46    47 [  2]     ntdll!RtlpLookupDynamicFunctionEntry
   37   224 [  1]   ntdll!RtlLookupFunctionEntry
 5068 30567 [  0] ntdll!RtlDispatchException
   22     0 [  1]   ntdll!RtlpxVirtualUnwind
   82     0 [  2]     ntdll!RtlpUnwindFunctionFull
Blocks: 1525743
Blocks: 1529179

A tab just crashed on me, and I saw the crash reporter. This on Arm64 windows nightly 67.0a1 (2019-02-20) (64-bit).

Gabriele, did the fix in bug 1528304 help at all?

Flags: needinfo?(gsvelto)

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #19)

Gabriele, did the fix in bug 1528304 help at all?

That bug fixed one side of the problem, and we've started to get some crash reports coming in, but comment 17 still stands in the way of many reports.

Flags: needinfo?(gsvelto)

This works around the issue where if the PC and SP don't change while unwinding our JIT frame, we'll fail the unwinder's sanity checks and it won't call our exception handler.

Ideally we'd store proper unwind info, but that's a larger change for another day.

Yes, as David already said we're getting some crash reports (here's one with a decent stack-trace https://crash-stats.mozilla.com/report/index/2b3d6cdf-a5cf-4f3e-a3a4-03fd80190221) but not all of them.

Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a00f8b23be0c
Add a fake unwind code to arm64 JIT function entries r=luke
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

:egao, want to try re-running xpcshell tests to see if this made any improvement?

Flags: needinfo?(egao)

(In reply to David Major [:dmajor] from comment #25)

:egao, want to try re-running xpcshell tests to see if this made any improvement?

For the post-patch revision, mozilla-central revision from 2/26 (PST) was used, revision d326a9d.

Try:

opt-xpcshell-5 failures appears to have been addressed.

Failures per chunk:

  • opt-xpcshell-2: same failures between pre and post-patch
  • opt-xpcshell-6: same as above
  • opt-xpcshell-7: same as above
  • opt-xpcshell-8: same as above

Failures in chunks 2, 6, 7 are few and far in between. Most tests appear to pass in the given chunks.

Failures in chunk 8 appears numerous. 71 failures, 337 passes - 21% failure rate, unchanged between patch.

Does this align with your expectations :dmajor?

(In reply to Edwin Gao (:egao) from comment #26)

Does this align with your expectations :dmajor?

Well, I was hoping to see that this fixed the test_crash_* failures in X8, but it looks like they were already gone before this patch. I guess I can't complain!

Please nominate this for uplift whenever you feel comfortable.

Flags: needinfo?(dmajor)

:dmajor/:froydnj, I am reviewing the bug dependency tree for windows10-aarch64.

In bug 1525378 it was mentioned that lack of crashreporter was the likely cause for the failures. In the try push from comment 26 it is possible to see the same failures still occur after patch for this bug has been landed. The failure details are also similar if not identical.

What is the next step for bug 1525378, since the prevailing thought was that it would be fixed with this patch?

Flags: needinfo?(egao)

(In reply to Edwin Gao (:egao) from comment #29)

What is the next step for bug 1525378, since the prevailing thought was that
it would be fixed with this patch?

In this case the next step is to remove the assumption that it would be fixed by this patch, and get someone to debug it in its own right.

Flags: needinfo?(dmajor)

I'm assuming that https://crash-stats.mozilla.com/report/index/20bff942-5948-4a1e-8c9a-da3390190227 is a report that we would not have received without this patch, so I'll count that as validation.

Comment on attachment 9046010 [details]
Add a fake unwind code to arm64 JIT function entries

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: None
  • User impact if declined: Missing crash reports on arm64
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: about:crashparent and about:crashcontent
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This code is compiled only on arm64, and it runs only after we've already crashed
  • String changes made/needed: No
Attachment #9046010 - Flags: approval-mozilla-beta?

Comment on attachment 9046010 [details]
Add a fake unwind code to arm64 JIT function entries

Looks like this patch helped us get crash reporting working again.
Thanks! OK for beta 12 uplift.

Attachment #9046010 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Whiteboard: [qa-triaged]

Verified as fixed on Firefox Nightly 67.0a1 (2019-02-28) and on Firefox 66.0b12 aarch64 builds on Lenovo Yoga C630-13Q50 with Windows 10 Home.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
QA Whiteboard: [qa-triaged]
Whiteboard: [qa-triaged]
You need to log in before you can comment on or make changes to this bug.