Closed Bug 1683232 Opened 3 years ago Closed 3 years ago

Crash in [@ EMPTY: no frame data available; OK]

Categories

(Toolkit :: Crash Reporting, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox84 --- wontfix
firefox85 --- wontfix
firefox86 --- fixed

People

(Reporter: mccr8, Assigned: gsvelto)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/37bb9a10-6b75-4804-b1f1-5c4910201217

Crashes with this signature have no stack, which seems bad. There are some recent crashes on Nightly in the GPU process, but I don't know if that's significant.

Looks like a bug in the stack-walker, if I open the minidump in Visual Studio I get a sensible stack-trace.

All crashes are coming from Windows 10 version 10.0.19041 or 10.0.19042. The CPU context size is larger (1663) than what breakpad expects (1232). Sounds Like Microsoft might have updated the minidump format or they have a bug in windbg.dll when writing minidumps.

I just found a minidump with a different size, 3263 bytes this time. This feels very odd.

I've got a patch that extracts the AMD64 context from the first 1232 bytes and skips over the rest but I still need to do some patching up. Interestingly the rust-minidump crate suffers from the same issue so it's probably some last-minute undocumented change on Microsoft side.

I just discovered another bit of information: all these minidumps come from machines with either Intel's Tiger Lake processors or AMD Ryzen 3 processors. The additional state in the AMD64 context must be specific to them, maybe some newfangled extension I haven't heard about yet?

Further investigation turned out these patterns:

  • x86 contexts for Intel processors are 2023 bytes larger than they should be
  • AMD64 contexts for Intel processors are 2031 (2023 + 8) bytes larger than they should be
  • x86 contexts for AMD processors are 423 bytes larger than they should be
  • AMD64 contexts for AMD processors are 431 (423 + 8) bytes larger than they should be
  • There are no apparent differences between minidumps generated by Windows 10.0.19041 and 10.0.19042

So it seems that Microsoft added a piece of state to the minidump format that is:

  • Only present in the last generation of x86 processors (Ryzen 3 and Tiger Lake)
  • Only written out in Windows 10 starting with version 10.0.19041
  • Different between Intel and AMD
  • Probably contains two pointers given the size different between the 32- and 64-bit contexts
  • Not documented in Microsoft minidump-related headers

I'll write a patch to simply skip over this new area but it would be interesting to figure out what it contains.

By looking into the new Intel and AMD ISA manuals I noticed that XSAVE instructions can now store into a variable-sized area that includes both already existing and new state: AVX, AVX-512, CET, MPX, PT, PKRU, HDC & HWP. I suspect I'm looking at that stuff. Most of the new bits are Intel-specific but at least a few of those are are also present on Ryzen 3 (CET for one) so that might explain the size difference. The fact that it's variable-sized makes writing a patch to accommodate for this a bit harder without knowing where too look for flags describing this area.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9c7eb4cade8b
Handle non-fixed size AMD64 and x86 contexts when processing minidumps r=KrisWright
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch

Note: this only fixed the issue in automation where we use the in-tree stack walker. We still have to fix bug 1683873 for this crash signature to go away.

The patch landed in nightly and beta is affected.
:gsvelto, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

This mostly affects Socorro so no need to uplift it. Our automation runners would benefit from the uplift but none of them is on bleeding edge hardware either, so not a problem.

Flags: needinfo?(gsvelto)
Depends on: 1687201
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: