Closed Bug 1501343 Opened 11 months ago Closed 7 months ago

Crash in vcruntime140.dll@0xcacf | sftk_NewAttribute

Categories

(NSS :: Libraries, defect, critical)

x86_64
Windows
defect
Not set
critical

Tracking

(firefox-esr60 unaffected, firefox63+ wontfix, firefox64 ?, firefox65 ?)

RESOLVED WORKSFORME
Tracking Status
firefox-esr60 --- unaffected
firefox63 + wontfix
firefox64 --- ?
firefox65 --- ?

People

(Reporter: philipp, Unassigned)

References

Details

(4 keywords)

Crash Data

This bug was filed from the Socorro interface and is
report bp-b181fd25-6aae-4fce-b8d1-feaf10181023.
=============================================================

Top 10 frames of crashing thread:

0 vcruntime140.dll vcruntime140.dll@0xcacf 
1 softokn3.dll struct SFTKAttributeStr* sftk_NewAttribute security/nss/lib/softoken/pkcs11u.c:66
2 softokn3.dll NSC_GenerateKeyPair security/nss/lib/softoken/pkcs11c.c:4788
3 nss3.dll PK11_GenerateKeyPairWithOpFlags security/nss/lib/pk11wrap/pk11akey.c:1530
4 nss3.dll SECKEY_CreateECPrivateKey security/nss/lib/cryptohi/seckey.c:219
5 nss3.dll ssl_CreateECDHEphemeralKeyPair security/nss/lib/ssl/ssl3ecc.c:451
6 nss3.dll ssl3_SendECDHClientKeyExchange security/nss/lib/ssl/ssl3ecc.c:192
7 nss3.dll static _SECStatus ssl3_SendClientSecondRound security/nss/lib/ssl/ssl3con.c:7606
8 nss3.dll ssl3_HandleHandshakeMessage security/nss/lib/ssl/ssl3con.c:11632
9 nss3.dll ssl3_HandleNonApplicationData security/nss/lib/ssl/ssl3con.c:12340

=============================================================

crash reports with this signature seem to be regressing in volume in firefox 63.
they're all coming from 64bit installations and amd "cpu family 20 model 1 stepping 0 | 2" - this particular cpu has a history of being problematic (bug 772330 & others...)
Duplicate of this bug: 1501643
If this is CPU related, is there a known workaround?
Flags: needinfo?(dmajor)
Historically the only things we've been able to do are
(1) Rebuild and hope that PGO nondeterminism makes the binary different enough to avoid triggering the crash, or
(2) Look for functions that repeatedly fall victim to this issue and refactor them.

In this case neither is likely to work because the crash is outside of our code.

Probably the best you could do is ask Microsoft to mine their own WER data and confirm whether they see a lot of hits in vcruntime140.dll@0xcacf in version 14.13.26020.0 amd64, and ask them to rework that function (memset?), assuming that's even the latest version of that DLL.

(PS - why is this bug hidden?)
Flags: needinfo?(dmajor)
Group: core-security → crypto-core-security
Assignee: nobody → nobody
Component: Security: PSM → Libraries
Product: Core → NSS
QA Contact: franziskuskiefer
Version: unspecified → other
> (PS - why is this bug hidden?)

All wildptr (apparently real pointer, but faulted) READ failures
That's our biggest new crasher on 63, can we get this one prioritized and assigned? I'd take a patch for this crash in a dot release if we have one.
Flags: needinfo?(franziskuskiefer)
Since this is an apparent issue in an underlying OS library on a specific CPU, it may be tough.  However, it's possible that what's happening here is that the CPU is messing up earlier (in our code), and the data only gets touched after it's passed to an OS routine.  This wouldn't be amazingly surprising - in fact, I would expect that's most likely.
I'll note that every single 63bN appears to have the same crash (on the same CPU), so rebuilds likely won't help.  And the function that calls into this PORT_Memcpy is unchanged since 2016 (though it could be 'value', coming from NSC_GenerateKeyPair, though that also looks unchanged).

So my guess would be something about what the optimizer/PGO does to one of these two functions, on this CPU, and for some reason starting in 63 (did we change compilers, options?)
(In reply to Randell Jesup [:jesup] from comment #7)
>(did we change compilers, options?)
yes, we switched to clang-cl in firefox 63 (bug 1443590)
I don't think that there's anything we can do from the NSS side. As Randell said, the code path didn't change in a couple years.

clang-cl could actually be the culprit here.
Flags: needinfo?(franziskuskiefer)
David, it seems this crash spike in 63 may be caused by our migration to clang, could you investigate please? Thanks (also CCed ajones for completeness)
Flags: needinfo?(dmajor)
Do we know what goes wrong with this CPU, or is it just unaccountably evil?  Did we ever talk to Intel?
it's an amd cpu - some investigation (and outreach) was done in https://bugzilla.mozilla.org/show_bug.cgi?id=amdbug
i've asked an affected user at https://support.mozilla.org/en-US/questions/1238391 if they also crash when switching to the beta channel to determine if the issue is build specific. unfortunately they are also crashing with 64.0b, although with a windows error report and not our crash reporter kicking in
(In reply to [:philipp] from comment #13)

If the user is willing, they could try an ASan build[1] which will provide a symbolized stack trace on stderr without the dependence on the error reporters. Or they could try using an ASan Nightly Project build[2] which uses a different reporting method.

[1] https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.latest.firefox.win64-asan-opt/artifacts/public/build/target.zip
[2] https://developer.mozilla.org/en-US/docs/Mozilla/Testing/ASan_Nightly_Project
I hope and ASAN build will trigger it; but if this *is* a CPU bug (and that's likely, very) then the ASAN code generation may very well be different enough not to trigger.  fingers crossed, though.

Crashes are up to 700/day as of Thursday.... :-( :-(
unfortunately the windows asan builds didn't provide proper logs from crashing when we tested it.

perhaps we could change the stub installer so that we serve 32-bit versions to the affected systems (windows 7, cpu info "family 20 model 1 stepping 0 | 2", cpu microcode versions: "0x5000025", "0x5000026", "0x5000028").
We also should reach out to AMD again
What if we rolled our own memcpy for this one callsite? A naive implementation is likely to look very different from vcruntime's which I'm pretty sure tries to do all sorts of clever things with SIMD etc.
Flags: needinfo?(dmajor)
Using a different memcpy (or doing it inline) might well change the code enough to side-step the bug.  The impl of the memcpy itself isn't likely the problem; it's the argument passed into it that likely was trashed.
I've got a few of these CPUs at the office. I'll boot one up tomorrow and try 63 on it. At least one of them has the bad silicon, but I've had trouble reproducing crashes. (I hit a JIT crash exactly once but it was the exact crazy behavior of CPU bug we were looking for).
:iain helped take a look with the AMD bobcat laptops and was not able to reproduce the crashes on 63 (including fully up to date windows).
Group: crypto-core-security
Wontfix for 63 as the crashes went down on 63 and we are unlikely to have another dot release before 64 ships in 3 weeks from now.

Hi, I am new here. Stumble upon this bug.

The socorro report mentioned about a raw dump. It would be helpful to be able to download this raw dump. Does anyone have a link to it.

From Raw Dump Description

{ "crash_info": { "address": "0x10c02dddf8b", "crashing_thread": 7, "type": "EXCEPTION_ACCESS_VIOLATION_READ" }, "crashing_thread": { "frames": [ { "frame": 0, "missing_symbols": true, "module": "VCRUNTIME140.dll", "module_offset": "0xcacf", "offset": "0x7fefb1ccacf", "registers": { "r10": "0x0000010c02dddf8b", "r11": "0x000000001fd6a15c", "r12": "0x000007fee6b51b00", "r13": "0x000007fee6b51b00", "r14": "0x0000010c02dddf8b", "r15": "0x0000000000000001", "r8": "0x0000000000000001", "r9": "0x000007fefb1c0000", "rax": "0x000000001fd6a15c", "rbp": "0x0000000000000001", "rbx": "0x00000000000000c0", "rcx": "0x000007fefb1ccacf", "rdi": "0x000000001fd6a000", "rdx": "0x0000010c02dddf8b", "rip": "0x000007fefb1ccacf", "rsi": "0x0000000000000103", "rsp": "0x0000000002dddca8" }, "trust": "context" }

The faulty address is 0x0000010c02dddf8b, the stack pointer is at 0x2dddca8. The calling function was doing a

PORT_Memory(attribute->attrib.pValue, value, len);

The exception is EXCEPTION_ACCESS_VIOLATION_READ, it appear to faulty address was in the variable "value". The variable "value" should point to somewhere in stack so it should be like 0x02dddXXX. The lower 32 bit of the fault was indeed 0x02dddf8b. But the most significant 32 bit has a value of 0x10c instead of 0. This would cause an address fault. The value of 0x10c happens to be the value of type CKA_DERIVE. It appear the value of "value" was picked up from the stack incorrectly. sftk_NewAttrbute did not put type value on stack. It is put on the stack by NSC_GenerateKeyPair() and picked up by sftk_NewAttrbute. Checked all the type declarations, they seem OK. Need to examine the raw dump to see how the compiler mistakenly placed the wrong values in stack.

One thing I noticed was the calling sequence from source is
NSC_GenerateKeyPair -> sftk_AddAttributeType -> sftk_NewAttribute

But the raw dump only showed NSC_GenerateKeyPair -> sftk_NewAttribute. The compiler optimizer compacted the call to sftk_AddAttribute, this may have caused sftk_NewAttribute to pick up the parameters from the wrong place in stack.

Since we're not seeing this crash anymore, and there's nothing obviously wrong with the code in the frame, marking worksforme.

Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.