1501343 - Crash in vcruntime140.dll@0xcacf | sftk_NewAttribute

Reporter

Description

•

7 years ago

This bug was filed from the Socorro interface and is report bp-b181fd25-6aae-4fce-b8d1-feaf10181023. ============================================================= Top 10 frames of crashing thread: 0 vcruntime140.dll vcruntime140.dll@0xcacf 1 softokn3.dll struct SFTKAttributeStr* sftk_NewAttribute security/nss/lib/softoken/pkcs11u.c:66 2 softokn3.dll NSC_GenerateKeyPair security/nss/lib/softoken/pkcs11c.c:4788 3 nss3.dll PK11_GenerateKeyPairWithOpFlags security/nss/lib/pk11wrap/pk11akey.c:1530 4 nss3.dll SECKEY_CreateECPrivateKey security/nss/lib/cryptohi/seckey.c:219 5 nss3.dll ssl_CreateECDHEphemeralKeyPair security/nss/lib/ssl/ssl3ecc.c:451 6 nss3.dll ssl3_SendECDHClientKeyExchange security/nss/lib/ssl/ssl3ecc.c:192 7 nss3.dll static _SECStatus ssl3_SendClientSecondRound security/nss/lib/ssl/ssl3con.c:7606 8 nss3.dll ssl3_HandleHandshakeMessage security/nss/lib/ssl/ssl3con.c:11632 9 nss3.dll ssl3_HandleNonApplicationData security/nss/lib/ssl/ssl3con.c:12340 ============================================================= crash reports with this signature seem to be regressing in volume in firefox 63. they're all coming from 64bit installations and amd "cpu family 20 model 1 stepping 0 | 2" - this particular cpu has a history of being problematic (bug 772330 & others...)

Pascal Chevrel:pascalc

Updated

•

7 years ago

tracking-firefox63: --- → +

Randell Jesup [:jesup] (needinfo me)

Comment 2

•

7 years ago

If this is CPU related, is there a known workaround?

Flags: needinfo?(dmajor)

(Away)

Comment 3

•

7 years ago

Historically the only things we've been able to do are (1) Rebuild and hope that PGO nondeterminism makes the binary different enough to avoid triggering the crash, or (2) Look for functions that repeatedly fall victim to this issue and refactor them. In this case neither is likely to work because the crash is outside of our code. Probably the best you could do is ask Microsoft to mine their own WER data and confirm whether they see a lot of hits in vcruntime140.dll@0xcacf in version 14.13.26020.0 amd64, and ask them to rework that function (memset?), assuming that's even the latest version of that DLL. (PS - why is this bug hidden?)

Flags: needinfo?(dmajor)

Tyson Smith [:tsmith]

Updated

•

7 years ago

Group: core-security → crypto-core-security

Dana Keeler (she/her) [:keeler]

Updated

•

7 years ago

Assignee: nobody → nobody

Component: Security: PSM → Libraries

Product: Core → NSS

QA Contact: franziskuskiefer

Version: unspecified → other

Randell Jesup [:jesup] (needinfo me)

Comment 4

•

7 years ago

> (PS - why is this bug hidden?) All wildptr (apparently real pointer, but faulted) READ failures

Pascal Chevrel:pascalc

Comment 5

•

7 years ago

That's our biggest new crasher on 63, can we get this one prioritized and assigned? I'd take a patch for this crash in a dot release if we have one.

Flags: needinfo?(franziskuskiefer)

Randell Jesup [:jesup] (needinfo me)

Comment 6

•

7 years ago

Since this is an apparent issue in an underlying OS library on a specific CPU, it may be tough. However, it's possible that what's happening here is that the CPU is messing up earlier (in our code), and the data only gets touched after it's passed to an OS routine. This wouldn't be amazingly surprising - in fact, I would expect that's most likely.

Randell Jesup [:jesup] (needinfo me)

Comment 7

•

7 years ago

I'll note that every single 63bN appears to have the same crash (on the same CPU), so rebuilds likely won't help. And the function that calls into this PORT_Memcpy is unchanged since 2016 (though it could be 'value', coming from NSC_GenerateKeyPair, though that also looks unchanged). So my guess would be something about what the optimizer/PGO does to one of these two functions, on this CPU, and for some reason starting in 63 (did we change compilers, options?)

[:philipp]

Reporter

Comment 8

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #7) >(did we change compilers, options?) yes, we switched to clang-cl in firefox 63 (bug 1443590)

Franziskus Kiefer [:franziskus]

Comment 9

•

7 years ago

I don't think that there's anything we can do from the NSS side. As Randell said, the code path didn't change in a couple years. clang-cl could actually be the culprit here.

Flags: needinfo?(franziskuskiefer)

Pascal Chevrel:pascalc

Comment 10

•

7 years ago

David, it seems this crash spike in 63 may be caused by our migration to clang, could you investigate please? Thanks (also CCed ajones for completeness)

Flags: needinfo?(dmajor)

Randell Jesup [:jesup] (needinfo me)

Comment 11

•

7 years ago

Do we know what goes wrong with this CPU, or is it just unaccountably evil? Did we ever talk to Intel?

[:philipp]

Reporter

Comment 12

•

7 years ago

it's an amd cpu - some investigation (and outreach) was done in https://bugzilla.mozilla.org/show_bug.cgi?id=amdbug

[:philipp]

Reporter

Comment 13

•

7 years ago

i've asked an affected user at https://support.mozilla.org/en-US/questions/1238391 if they also crash when switching to the beta channel to determine if the issue is build specific. unfortunately they are also crashing with 64.0b, although with a windows error report and not our crash reporter kicking in

Tyson Smith [:tsmith]

Comment 14

•

7 years ago

(In reply to [:philipp] from comment #13) If the user is willing, they could try an ASan build[1] which will provide a symbolized stack trace on stderr without the dependence on the error reporters. Or they could try using an ASan Nightly Project build[2] which uses a different reporting method. [1] https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.latest.firefox.win64-asan-opt/artifacts/public/build/target.zip [2] https://developer.mozilla.org/en-US/docs/Mozilla/Testing/ASan_Nightly_Project

Randell Jesup [:jesup] (needinfo me)

Comment 15

•

7 years ago

I hope and ASAN build will trigger it; but if this *is* a CPU bug (and that's likely, very) then the ASAN code generation may very well be different enough not to trigger. fingers crossed, though. Crashes are up to 700/day as of Thursday.... :-( :-(

[:philipp]

Reporter

Comment 16

•

7 years ago

unfortunately the windows asan builds didn't provide proper logs from crashing when we tested it. perhaps we could change the stub installer so that we serve 32-bit versions to the affected systems (windows 7, cpu info "family 20 model 1 stepping 0 | 2", cpu microcode versions: "0x5000025", "0x5000026", "0x5000028").

Randell Jesup [:jesup] (needinfo me)

Comment 17

•

7 years ago

We also should reach out to AMD again

(Away)

Comment 18

•

7 years ago

What if we rolled our own memcpy for this one callsite? A naive implementation is likely to look very different from vcruntime's which I'm pretty sure tries to do all sorts of clever things with SIMD etc.

Flags: needinfo?(dmajor)

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

7 years ago

Using a different memcpy (or doing it inline) might well change the code enough to side-step the bug. The impl of the memcpy itself isn't likely the problem; it's the argument passed into it that likely was trashed.

Andrew McCreight [:mccr8]

Updated

•

6 years ago

Keywords: csectype-other

Ted Campbell [:tcampbell]

Comment 20

•

6 years ago

I've got a few of these CPUs at the office. I'll boot one up tomorrow and try 63 on it. At least one of them has the bad silicon, but I've had trouble reproducing crashes. (I hit a JIT crash exactly once but it was the exact crazy behavior of CPU bug we were looking for).

Ted Campbell [:tcampbell]

Comment 21

•

6 years ago

:iain helped take a look with the AMD bobcat laptops and was not able to reproduce the crashes on 63 (including fully up to date windows).

Daniel Veditz [:dveditz] out until Nov 10

Updated

•

6 years ago

Group: crypto-core-security

Daniel Veditz [:dveditz] out until Nov 10

Updated

•

6 years ago

Keywords: csectype-other → sec-other

Pascal Chevrel:pascalc

Comment 22

•

6 years ago

Wontfix for 63 as the crashes went down on 63 and we are unlikely to have another dot release before 64 ships in 3 weeks from now.

status-firefox63: affected → wontfix

Pun Lam

Comment 23

•

6 years ago

Hi, I am new here. Stumble upon this bug.

The socorro report mentioned about a raw dump. It would be helpful to be able to download this raw dump. Does anyone have a link to it.

From Raw Dump Description

{ "crash_info": { "address": "0x10c02dddf8b", "crashing_thread": 7, "type": "EXCEPTION_ACCESS_VIOLATION_READ" }, "crashing_thread": { "frames": [ { "frame": 0, "missing_symbols": true, "module": "VCRUNTIME140.dll", "module_offset": "0xcacf", "offset": "0x7fefb1ccacf", "registers": { "r10": "0x0000010c02dddf8b", "r11": "0x000000001fd6a15c", "r12": "0x000007fee6b51b00", "r13": "0x000007fee6b51b00", "r14": "0x0000010c02dddf8b", "r15": "0x0000000000000001", "r8": "0x0000000000000001", "r9": "0x000007fefb1c0000", "rax": "0x000000001fd6a15c", "rbp": "0x0000000000000001", "rbx": "0x00000000000000c0", "rcx": "0x000007fefb1ccacf", "rdi": "0x000000001fd6a000", "rdx": "0x0000010c02dddf8b", "rip": "0x000007fefb1ccacf", "rsi": "0x0000000000000103", "rsp": "0x0000000002dddca8" }, "trust": "context" }

The faulty address is 0x0000010c02dddf8b, the stack pointer is at 0x2dddca8. The calling function was doing a

PORT_Memory(attribute->attrib.pValue, value, len);

The exception is EXCEPTION_ACCESS_VIOLATION_READ, it appear to faulty address was in the variable "value". The variable "value" should point to somewhere in stack so it should be like 0x02dddXXX. The lower 32 bit of the fault was indeed 0x02dddf8b. But the most significant 32 bit has a value of 0x10c instead of 0. This would cause an address fault. The value of 0x10c happens to be the value of type CKA_DERIVE. It appear the value of "value" was picked up from the stack incorrectly. sftk_NewAttrbute did not put type value on stack. It is put on the stack by NSC_GenerateKeyPair() and picked up by sftk_NewAttrbute. Checked all the type declarations, they seem OK. Need to examine the raw dump to see how the compiler mistakenly placed the wrong values in stack.

One thing I noticed was the calling sequence from source is
NSC_GenerateKeyPair -> sftk_AddAttributeType -> sftk_NewAttribute

But the raw dump only showed NSC_GenerateKeyPair -> sftk_NewAttribute. The compiler optimizer compacted the call to sftk_AddAttribute, this may have caused sftk_NewAttribute to pick up the parameters from the wrong place in stack.

J.C. Jones [:jcj] (he/they)

Comment 24

•

6 years ago

Since we're not seeing this crash anymore, and there's nothing obviously wrong with the code in the frame, marking worksforme.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WORKSFORME