Closed Bug 1871151 Opened 1 year ago Closed 1 year ago

Crash in [@ gemmology::(anonymous namespace)::kernel::maddw]

Categories

(Firefox :: Translations, defect)

x86_64
All
defect

Tracking

()

VERIFIED FIXED
123 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox121 --- unaffected
firefox122 --- unaffected
firefox123 + fixed

People

(Reporter: mccr8, Assigned: sergesanspaille)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

[Tracking Requested - why for this release]:

Crash report: https://crash-stats.mozilla.org/report/index/7c105665-3c0d-461a-a614-b4baa0231220

Reason: EXCEPTION_ILLEGAL_INSTRUCTION

Top 10 frames of crashing thread:

0  xul.dll  gemmology::  third_party/gemmology/gemmology.h:208
0  xul.dll  gemmology::  third_party/gemmology/gemmology.h:640
0  xul.dll  gemmology::  third_party/gemmology/gemmology.h:646
0  xul.dll  gemmology::Engine<xsimd::avxvnni>::Shift::PrepareBias<gemmology::callbacks::UnquantizeAndAddBiasAndWrite>  third_party/gemmology/gemmology.h:1303
1  xul.dll  js::intgemm::IntrI8PrepareBias::<lambda_4>::operator const  js/src/intgemm/IntegerGemmIntrinsic.cpp:317
1  xul.dll  xsimd::detail::dispatcher<`lambda at /builds/worker/checkouts/gecko/js/src/intgemm/IntegerGemmIntrinsic.cpp:317:3', xsimd::arch_list<xsimd::avxvnni, xsimd::avx2, xsimd::ssse3, xsimd::sse2> >::walk_archs  third_party/xsimd/include/xsimd/config/xsimd_arch.hpp:238
1  xul.dll  xsimd::detail::dispatcher<`lambda at /builds/worker/checkouts/gecko/js/src/intgemm/IntegerGemmIntrinsic.cpp:317:3', xsimd::arch_list<xsimd::avxvnni, xsimd::avx2, xsimd::ssse3, xsimd::sse2> >::operator  third_party/xsimd/include/xsimd/config/xsimd_arch.hpp:253
1  xul.dll  js::intgemm::IntrI8PrepareBias  js/src/intgemm/IntegerGemmIntrinsic.cpp:301
2  ?  @0x00000176f245301e  
3  xul.dll  WasmMemoryCopy  js/src/wasm/WasmInstance.cpp:566

Looks like a regression from bug 1868949.

Set release status flags based on info from the regressing bug 1868949

:sergesanspaille, since you are the author of the regressor, bug 1868949, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(sguelton)
Crash Signature: [@ gemmology::(anonymous namespace)::kernel::maddw] → [@ gemmology::(anonymous namespace)::kernel::maddw] [@ gemmology::(anonymous namespace)::kernel::maddw<T> ]

This probably means that the runtime detection code is incorrect :-/

Flags: needinfo?(sguelton)

Based on the crash report, the proc is a Tiger Lake, which supports avx vnni. I've double checked- the detection code and it looks correct. And the stack trace points at vpdpbusd which is indeed an AVX VNNI instruction :-/

Serge, Tiger Lake supports AVX512 VNNI instructions (in 512-bit and 256-bit width), but that's not the same as AVXVNNI on later CPUs like Alder Lake, which have a different VEX prefix. When you see vpdpbusd, check whether the instruction prefix is correct, specifically, not VEX for Tiger Lake.

Yannis confirmed that the disassembly shows vex vpdpbusd, so it's AVXVNNI, but it means the detection code misfired. As far as I can tell you're checking the right bits though.

I'm looking through the code, and there was a suspicion "best" CPU detection was at fault (https://github.com/xtensor-stack/xsimd/blob/a48ab430d4b84ecd5449180ee1c6d2eed67c4191/include/xsimd/config/xsimd_cpuid.hpp#L189), but I don't see anything wrong there. Note that even if AVXVNNI detection misfires, it should be overruled by the AVX512_VNNI detection that follows.

What I do notice is that gemmology (https://github.com/mozilla/gemmology/blob/40dda91e99088ff80e21d71e57415aa491a0954c/gemmology.h#L208) ONLY has code for the AVXVNNI version, not the AVX512_VNNI one. So indeed misfiring detection (rather than "best") could still be the cause.

I'm still looking, but for example this looks like a minor bug (wouldn't cause this crash tho as we don't compile with Intel): https://github.com/xtensor-stack/xsimd/blob/a48ab430d4b84ecd5449180ee1c6d2eed67c4191/include/xsimd/config/xsimd_cpuid.hpp#L116

And this also looks suspicious, but it's probably dead code for Firefox: https://searchfox.org/mozilla-central/source/mozglue/misc/SSE.cpp#65 (comment that follows is also misleading)

Theory:
https://searchfox.org/mozilla-central/source/third_party/xsimd/include/xsimd/config/xsimd_arch.hpp#245

On AVX512 machines, this would set best_arch_found(available_architectures().best) to some level of AVX512 support, e.g. generic::version(3, 4, 1);. Wouldn't the following code: https://hg.mozilla.org/mozilla-central/file/37657c7691664026e54babf7d1cf608fe58a92fb/third_party/xsimd/include/xsimd/config/xsimd_arch.hpp#l237
then match on AVXVNNI generic::version(2, 3, 0)?

This seems to match the specific case we have here where higher version hardware support doesn't imply lower version support, and we provide a lower version routine, but not the higher version one.

OS: Windows 11 → All
Hardware: Unspecified → x86_64
Version: unspecified → Trunk

The bug is marked as tracked for firefox123 (nightly). However, the bug still isn't assigned.

:marco, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(mcastelluccio)

Serge is working on it.

Assignee: nobody → sguelton
Flags: needinfo?(mcastelluccio)
Severity: -- → S2
Pushed by sguelton@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cd8dc9b1338d Backport xsimd dispatch mechanism fix r=gcp
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 123 Branch

:gcp, could you confirm the fix?

Flags: needinfo?(gpascutto)

Working on Zen 4.

Flags: needinfo?(gpascutto)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: