Closed Bug 1687914 Opened 3 years ago Closed 2 years ago

Ryzen first generation Crash in [@ nsThread::GetLabeledRunnableName] and [@ mozilla::PerformanceCounterState::RunnableWillRun]

Categories

(Firefox Build System :: Toolchains, defect)

defect

Tracking

(firefox-esr78 unaffected, firefox84 unaffected, firefox85 unaffected, firefox86+ wontfix, firefox87 affected)

RESOLVED WONTFIX
Tracking Status
firefox-esr78 --- unaffected
firefox84 --- unaffected
firefox85 --- unaffected
firefox86 + wontfix
firefox87 --- affected

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: crash)

Crash Data

This is a new frequent crash starting with 86.0a1 20210120214738. The signature indicates this is a about a background hang.

Mak, Daisuke, could this be from bug 1678619? crash-stats says 'content' process crashed, but might this just mean it's off the main thread? Push log is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a3cd8f83fefafedaccd0c52a73ab001c510e63fe&tochange=488d7ff0470408d2c82881247c037b3c30a1db60

Crash report: https://crash-stats.mozilla.org/report/index/a8bdd63d-0235-49d0-a005-44a190210121

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll static nsThread::GetLabeledRunnableName xpcom/threads/nsThread.cpp:1001
1 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:741
2 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
3 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87
4 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:327
5 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:309
6 xul.dll nsBaseAppShell::Run widget/nsBaseAppShell.cpp:137
7 xul.dll nsAppShell::Run widget/windows/nsAppShell.cpp:602
8 xul.dll XRE_RunAppShell toolkit/xre/nsEmbedFunctions.cpp:902
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:327
Flags: needinfo?(mak)

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → XPCOM
Product: Firefox → Core

It's possible, but I'm not sure how to confirm/dismiss that. We modified some runnables, and this is crashing on executing a runnable, that makes it a possible culprit, but I don't have a clear picture of how this crash would happen.
Maybe we could do a backout and check if the crash goes away, to at least confirm?

Flags: needinfo?(mak)

Bug 1678619 got backed out but shipped in two Nightlies. We only got these crash reports for the first one. Let's keep the patches out of the next Nightly (10pm UTC). Sheriffs we reland it to get it in the Friday 10am Nightly. I will comment in this bug if we experience the issue with a newer Nightly.

Gabriele emphasized all these crashes are with 1st gen Ryzen processors.

Crash Signature: [@ nsThread::GetLabeledRunnableName] → [@ mozilla::PerformanceCounterState::RunnableWillRun] [@ nsThread::GetLabeledRunnableName]
Component: XPCOM → Toolchains
Product: Core → Firefox Build System
Summary: Crash in [@ nsThread::GetLabeledRunnableName] → Ryzen first generation Crash in [@ nsThread::GetLabeledRunnableName] and [@ mozilla::PerformanceCounterState::RunnableWillRun]

Under the first signature there's a few crashes that aren't coming from Ryzen 1 boxes and these appear to be machines with lousy memory: the crashes addresses and reasons are all over the place. The Ryzen 1 crashes however look all the same: a failed read from address 0xffffffffffffffff

Hard for that to be a coincidence

I looked through some minidumps and the crashing instructions seem to be either harmless leas or movdqu xmm1,xmmword ptr [xul!...kIID] which shouldn't have a problem. If the CPU thinks the instruction is actually something else, then the read address might be anything - which would very likely be noncanonical and so get reported to us as 0xffffffffffffffff due to bug 1493342.

I opened a couple of minidumps myself and found crashes including conditional jumps too (if the minidump is to be trusted). I'm going to have a look at AMD's erratas for Ryzen 1 in search of some hints because I found a whole bunch of signatures that shared these characteristics (Ryzen 1-specific, crash is a read to a non-canonical address, no such address present in the registers).

FYI there's two issues in the official errata document that sound like they may cause this: 1021 and 1091. Both cause stale data to be delivered to a load which was waiting on a preceding store. In both cases the errata mentions that they're triggered by a specific set of timing conditions. Given the crashes seem to be nightly-specific it might be that those particular versions had the right instruction sequences to trigger the flaw. That being said this is only speculation on my part, I don't think we have a way to verify this in practice given the complexity involved.

The original issue here is gone and the remaining crashes appear to be bad hardware.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
See Also: → 1796126
You need to log in before you can comment on or make changes to this bug.