Closed Bug 1501343 Opened 11 months ago Closed 7 months ago
Crash in vcruntime140
.dll@0xcacf | sftk _New Attribute
This bug was filed from the Socorro interface and is report bp-b181fd25-6aae-4fce-b8d1-feaf10181023. ============================================================= Top 10 frames of crashing thread: 0 vcruntime140.dll vcruntime140.dll@0xcacf 1 softokn3.dll struct SFTKAttributeStr* sftk_NewAttribute security/nss/lib/softoken/pkcs11u.c:66 2 softokn3.dll NSC_GenerateKeyPair security/nss/lib/softoken/pkcs11c.c:4788 3 nss3.dll PK11_GenerateKeyPairWithOpFlags security/nss/lib/pk11wrap/pk11akey.c:1530 4 nss3.dll SECKEY_CreateECPrivateKey security/nss/lib/cryptohi/seckey.c:219 5 nss3.dll ssl_CreateECDHEphemeralKeyPair security/nss/lib/ssl/ssl3ecc.c:451 6 nss3.dll ssl3_SendECDHClientKeyExchange security/nss/lib/ssl/ssl3ecc.c:192 7 nss3.dll static _SECStatus ssl3_SendClientSecondRound security/nss/lib/ssl/ssl3con.c:7606 8 nss3.dll ssl3_HandleHandshakeMessage security/nss/lib/ssl/ssl3con.c:11632 9 nss3.dll ssl3_HandleNonApplicationData security/nss/lib/ssl/ssl3con.c:12340 ============================================================= crash reports with this signature seem to be regressing in volume in firefox 63. they're all coming from 64bit installations and amd "cpu family 20 model 1 stepping 0 | 2" - this particular cpu has a history of being problematic (bug 772330 & others...)
If this is CPU related, is there a known workaround?
Historically the only things we've been able to do are (1) Rebuild and hope that PGO nondeterminism makes the binary different enough to avoid triggering the crash, or (2) Look for functions that repeatedly fall victim to this issue and refactor them. In this case neither is likely to work because the crash is outside of our code. Probably the best you could do is ask Microsoft to mine their own WER data and confirm whether they see a lot of hits in vcruntime140.dll@0xcacf in version 14.13.26020.0 amd64, and ask them to rework that function (memset?), assuming that's even the latest version of that DLL. (PS - why is this bug hidden?)
11 months ago
Assignee: nobody → nobody
Component: Security: PSM → Libraries
Product: Core → NSS
QA Contact: franziskuskiefer
Version: unspecified → other
> (PS - why is this bug hidden?) All wildptr (apparently real pointer, but faulted) READ failures
That's our biggest new crasher on 63, can we get this one prioritized and assigned? I'd take a patch for this crash in a dot release if we have one.
Since this is an apparent issue in an underlying OS library on a specific CPU, it may be tough. However, it's possible that what's happening here is that the CPU is messing up earlier (in our code), and the data only gets touched after it's passed to an OS routine. This wouldn't be amazingly surprising - in fact, I would expect that's most likely.
I'll note that every single 63bN appears to have the same crash (on the same CPU), so rebuilds likely won't help. And the function that calls into this PORT_Memcpy is unchanged since 2016 (though it could be 'value', coming from NSC_GenerateKeyPair, though that also looks unchanged). So my guess would be something about what the optimizer/PGO does to one of these two functions, on this CPU, and for some reason starting in 63 (did we change compilers, options?)
(In reply to Randell Jesup [:jesup] from comment #7) >(did we change compilers, options?) yes, we switched to clang-cl in firefox 63 (bug 1443590)
I don't think that there's anything we can do from the NSS side. As Randell said, the code path didn't change in a couple years. clang-cl could actually be the culprit here.
David, it seems this crash spike in 63 may be caused by our migration to clang, could you investigate please? Thanks (also CCed ajones for completeness)
Do we know what goes wrong with this CPU, or is it just unaccountably evil? Did we ever talk to Intel?
it's an amd cpu - some investigation (and outreach) was done in https://bugzilla.mozilla.org/show_bug.cgi?id=amdbug
i've asked an affected user at https://support.mozilla.org/en-US/questions/1238391 if they also crash when switching to the beta channel to determine if the issue is build specific. unfortunately they are also crashing with 64.0b, although with a windows error report and not our crash reporter kicking in
(In reply to [:philipp] from comment #13) If the user is willing, they could try an ASan build which will provide a symbolized stack trace on stderr without the dependence on the error reporters. Or they could try using an ASan Nightly Project build which uses a different reporting method.  https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.latest.firefox.win64-asan-opt/artifacts/public/build/target.zip  https://developer.mozilla.org/en-US/docs/Mozilla/Testing/ASan_Nightly_Project
I hope and ASAN build will trigger it; but if this *is* a CPU bug (and that's likely, very) then the ASAN code generation may very well be different enough not to trigger. fingers crossed, though. Crashes are up to 700/day as of Thursday.... :-( :-(
unfortunately the windows asan builds didn't provide proper logs from crashing when we tested it. perhaps we could change the stub installer so that we serve 32-bit versions to the affected systems (windows 7, cpu info "family 20 model 1 stepping 0 | 2", cpu microcode versions: "0x5000025", "0x5000026", "0x5000028").
We also should reach out to AMD again
What if we rolled our own memcpy for this one callsite? A naive implementation is likely to look very different from vcruntime's which I'm pretty sure tries to do all sorts of clever things with SIMD etc.
Using a different memcpy (or doing it inline) might well change the code enough to side-step the bug. The impl of the memcpy itself isn't likely the problem; it's the argument passed into it that likely was trashed.
I've got a few of these CPUs at the office. I'll boot one up tomorrow and try 63 on it. At least one of them has the bad silicon, but I've had trouble reproducing crashes. (I hit a JIT crash exactly once but it was the exact crazy behavior of CPU bug we were looking for).
:iain helped take a look with the AMD bobcat laptops and was not able to reproduce the crashes on 63 (including fully up to date windows).
Wontfix for 63 as the crashes went down on 63 and we are unlikely to have another dot release before 64 ships in 3 weeks from now.
Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.