Closed Bug 618068 Opened 14 years ago Closed 14 years ago

Cipher finalize failed on startup

Categories

(Firefox :: Sync, defect)

x86
macOS
defect
Not set
normal

Tracking

()

VERIFIED FIXED
Tracking Status
blocking2.0 --- beta8+
fennec 2.0b3+ ---

People

(Reporter: rnewman, Assigned: philikon)

References

Details

Attachments

(2 files)

See also Bug 617650.

Log attached.
Info:

* M-C build: 

parent: 59027:799ba3224669 tip
 Bug 617877: Web console code completion suggestion sticks around when it shouldn't, r+a=gavin
branch: default
commit: (clean)
update: 1 new changesets, 2 branch heads (merge)
mq:     1 unapplied


* fx-sync addon build:

parent: 3177:e4b76e1f74cc tip
 Bug 617868 follow-up: fix test
branch: default
commit: 1 modified, 1 unknown
update: (current)


I.e., up-to-date as of about 1pm today.


M-C includes:

changeset:   58968:0f28e112a6b6
user:        Kai Engert <kaie@kuix.de>
date:        Thu Dec 09 12:04:11 2010 +0100
summary:     Bug 599324, land NSPR_4_8_7_BETA1 and NSS_3_12_9_BETA2, a=blocking-2.0-beta8+


`make xpcshell-tests` passes all 89 tests in fx-sync using that M-C build.
To sum up Richard's findings:

* unit tests are working fine, syncing (or logging in) does not.
* it's ok with yesterday's nightlies which didn't have the NSS landing in them
* it breaks with a build from today or today's nightlies

Likely culprit: NSS
Marking as blocking beta 8 for now.
blocking2.0: --- → beta8+
(In reply to comment #0)
> See also Bug 617650.

Richard, can you clarify?
(In reply to comment #4)

> Richard, can you clarify?

Sorry, missed this append.

Clarified on IRC: that bug captured my observation of a similar issue... but it didn't occur reproducibly on startup, and didn't occur with the changes I was making at the time. This seems much more pernicious.
tracking-fennec: --- → 2.0b3+
This is notoriously hard to reproduce, but I've just managed it:

Using a regular nightly (with edited WeaveCrypto.js to produce logging output):

1. Change My Sync Key
2. Run into bug 618179 which is only consolable by restarting the browser
3. Trying to outsmart Sync by just Disconnecting and then Connecting again.

Debugging output revealed this:

PK11_AlgtagToMechanism called... PORT_GetError() return value: -5987
PK11_GetPadMechanism called... PORT_GetError() return value: -5987
PK11_ParamFromIV called... PORT_GetError() return value: -5987
PK11_GetInternalKeySlot called... PORT_GetError() return value: -5987
PK11_ImportSymKey called... PORT_GetError() return value: -5987
PK11_CreateContextBySymKey called... PORT_GetError() return value: -5987
PK11_CipherOp return value: 0 PORT_GetError() return value: -5987
PK11_DigestFinal return value: -1 PORT_GetError() return value: -8023
WeaveCrypto: _commonCrypt: failed: [Exception... "cipher finalize failed"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://services-crypto/WeaveCrypto.js :: <TOP_LEVEL> :: line 580"  data: no]

The -5987 error code occurs a lot during successful (sic!) crypto operations, so it might be a red herring. -8023 is the actual error.
(In reply to comment #6)
> -8023 is the actual error.

According to secerr.h:

#define SEC_ERROR_BASE				(-0x2000)
...
SEC_ERROR_PKCS11_DEVICE_ERROR              =	(SEC_ERROR_BASE + 169),

-0x2000 + 169 = -8023

I honestly have no idea what that error means.
Found one way to reproduce this.

Repro:
1) install minefield nightly: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8pre) Gecko/20101209 Firefox/4.0b8pre
2) sync your account successfully.
3) open prefs, and disconnect your account.  then connect again.  Check sync log that everything is ok.
4) Open prefs > Generate and Change sync key
5) Check sync log and verify error from bug 618179.  skip this and notice the notification pops up saying sync error, with a Sync Now button
6) Click that sync button:

2010-12-09 17:39:29	Service.Main         INFO	In sync().
2010-12-09 17:39:29	Net.Resource         DEBUG	GET success 200 https://phx-sync468.services.mozilla.com/1.0/msvdqjdrj2rglmx26it7rmuoqg3wzjzv/info/collections
2010-12-09 17:39:29	Service.Main         DEBUG	Fetching global metadata record
2010-12-09 17:39:29	Service.Main         DEBUG	Weave Version: @weave_version@ Local Storage: 4 Remote Storage: 4
2010-12-09 17:39:29	Service.Main         INFO	Sync key is up-to-date: no need to upgrade.
2010-12-09 17:39:29	Service.Main         DEBUG	Fetching and verifying -- or generating -- symmetric keys.
2010-12-09 17:39:29	Service.Main         INFO	Testing info/collections: {"meta":1291945161.03,"crypto":1291945161.98,"clients":1291945108.33,"bookmarks":1291945108.44,"forms":1291944939.98,"history":1291944940.08,"prefs":1291945108.86,"tabs":1291944940.35}
2010-12-09 17:39:29	CollectionKeys       INFO	Testing for updateNeeded. Last modified: 1291945161.76
2010-12-09 17:39:29	Service.Main         INFO	CollectionKeys reports that a key update is needed.
2010-12-09 17:39:30	Net.Resource         DEBUG	GET success 200 https://phx-sync468.services.mozilla.com/1.0/msvdqjdrj2rglmx26it7rmuoqg3wzjzv/storage/crypto/keys
2010-12-09 17:39:30	CollectionKeys       INFO	Updating collection keys...
2010-12-09 17:39:30	CollectionKeys       WARN	Got exception "[Exception... "cipher finalize failed"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://services-crypto/WeaveCrypto.js :: <TOP_LEVEL> :: line 557"  data: no]" decrypting storage keys with sync key.
2010-12-09 17:39:30	CollectionKeys       INFO	Aborting updateContents. Rethrowing.
2010-12-09 17:39:30	Service.Main         WARN	Got exception "[Exception... "cipher finalize failed"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://services-crypto/WeaveCrypto.js :: <TOP_LEVEL> :: line 557"  data: no]" fetching cryptoKeys.
2010-12-09 17:39:30	Service.Main         WARN	Failed to fetch symmetric keys. Failing remote setup.
2010-12-09 17:39:30	Service.Main         CONFIG	Starting backoff, next sync at:Thu Dec 09 2010 19:05:51 GMT-0800 (PST)
2010-12-09 17:39:30	Service.Main         DEBUG	Exception: aborting sync, remote setup failed No traceback available
[17:42] bsmith: In pk11err.c, there is PK11_MapError
[17:42] bsmith: and we can see it maps CKR_DEVICE_ERROR to SEC_ERROR_PKCS11_DEVICE_ERROR
[17:43] bsmith: So, in softoken I will look for why it can return CKR_DEVICE_ERROR from NSC_DecyptFinal
[17:43] bsmith: This is decryption, right?
[17:43] philiKON: yes
[17:44] philiKON: note that it looks like Sync might be trigger this after all
[17:44] philiKON: our STR include hitting a bug and it's not entirely clear to me whether Sync then behaves sanely after hitting that bug
[17:45] bsmith: OK.
[17:45] philiKON: rnewman is looking into it right now
[17:45] bsmith: Unforunately, CKR_DEVICE_ERROR is the default error in softoken
[17:45] philiKON: gah
[17:46] bsmith: in pkcs11c.c sftk_MapCryptError maps error codes from freebl to CKR_* codes.
[17:46] bsmith: and the default is CKR_DEVICE_ERROR for "backward compatibility" ?!?
There's something wacky going on in sync (bug 618179) which is the real blocker here.  We should still dig deeper here, but this might be bad input from sync.  Still should bulletproof this, but not a blocker for b8 (bug 618179 replaces this one)
blocking2.0: beta8+ → ---
tracking-fennec: 2.0b3+ → ---
Philipp, do you still observe this with my last two fixes? (Useful info to have in this bug.)
(In reply to comment #11)
> Philipp, do you still observe this with my last two fixes? (Useful info to have
> in this bug.)

I can no longer reproduce this as described in comment 6, since bug 618179 was fixed. I do believe there's an underlying issue here, but we definitely have seemed to remove the biggest cause.
What's the state for users of nightly builds? I see this error using the 20101210 nightly, which should actually include the fix for bug 618179...
I too just ran into this again with a fresh m-c build this morning

1. Created a new account, sync
2. Stop Using This Account
3. Connect to the account again, sync
I hit that this morning also on 12/10 trunk.  but this time i did NOT connect/disconnect nor change Sync key like comment 8.   I was simply trying to add a device, and allowing my desktop sync to auto-refresh.

log attached.
Attached file cipher error log
Renominating this for a blocker.
blocking2.0: --- → ?
My suspicion is that one of the inputs to the failing function is bad (e.g. null pointer or buffer too small). I am going to work on getting more helpful diagnostics in NSS and in WeaveCrypto to help pinpoint what is wrong (if anything). Also, I am investigating what might have changed (if anything) in these functions between NSS 3.12.8 and 3.12.9.
Got this bug on Seamonkey trunk, win XP 32bit.
Immediately after update sync addon from 1.6.b2 to 1.6.b3 got:
2010-12-10 20:12:53	Service.Main         INFO	In sync().
2010-12-10 20:12:53	Net.Resource         DEBUG	GET success 200 https://myhost/user/info/collections
2010-12-10 20:12:53	Service.Main         DEBUG	Fetching global metadata record
2010-12-10 20:12:53	Service.Main         DEBUG	Weave Version: 1.6b3 Local Storage: 5 Remote Storage: 5
2010-12-10 20:12:53	Service.Main         INFO	Sync key is up-to-date: no need to upgrade.
2010-12-10 20:12:53	Service.Main         DEBUG	Fetching and verifying -- or generating -- symmetric keys.
2010-12-10 20:12:53	Service.Main         INFO	Testing info/collections: {"crypto":1292002340.01,"history":1292002353.31,"meta":1292002338.83,"tabs":1292002353.49}
2010-12-10 20:12:53	CollectionKeys       INFO	Testing for updateNeeded. Last modified: 1292002339.61
2010-12-10 20:12:53	Service.Main         INFO	CollectionKeys reports that a key update is needed.
2010-12-10 20:12:53	Net.Resource         DEBUG	GET success 200 https://myhost/user/storage/crypto/keys
2010-12-10 20:12:53	CollectionKeys       INFO	Updating collection keys...
2010-12-10 20:12:53	CollectionKeys       WARN	Got exception "[Exception... "cipher finalize failed"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://services-crypto/WeaveCrypto.js :: <TOP_LEVEL> :: line 557"  data: no]" decrypting storage keys with sync key.
2010-12-10 20:12:53	CollectionKeys       INFO	Aborting updateContents. Rethrowing.
2010-12-10 20:12:53	Service.Main         WARN	Got exception "[Exception... "cipher finalize failed"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://services-crypto/WeaveCrypto.js :: <TOP_LEVEL> :: line 557"  data: no]" fetching cryptoKeys.
2010-12-10 20:12:53	Service.Main         WARN	Failed to fetch symmetric keys. Failing remote setup.
2010-12-10 20:12:53	Service.Main         DEBUG	Exception: aborting sync, remote setup failed No traceback available
OS: Mac OS X → All
I reviewed the changes security/nss in the patch that upgraded NSS to 3.12.9. Literally nothing change in the encryption/decryption code and the other changes seem extremely unlikely to cause changes to AES encryption/decryption. If that landing caused any problems it would most likely be something in NSPR but that seems also very unlikely.
OS: All → Mac OS X
Wait, I take back comment 20. I do see something that changed in the AES code. I will investigate it further.
blocking2.0: ? → beta8+
tracking-fennec: --- → 2.0b3+
It looks like the AES changes in NSS 3.12.9 only affect Linux x64 and Solaris x64.
We are using PKCS#7 padding. That means the plaintext must end in one of these values:

0x01
0x02 0x02
0x03 0x03 0x03
...
0x0F 0x0F .... 0x0F 0x0F

If the data does not end in one of these byte strings then the padding is invalid. The padding could be invalid because (a) we are using a different key than was used to encrypt the data, or (b) the padding is actually incorrect. When the padding is invalid then Softoken *should* return something like CKR_ENCRYPTED_DATA_INVALID. However, the error reporting logic in NSC_DecryptFinal is broken for the specific case of invalid padding and it could return an unpredictable value but most likely the default, CKR_DEVICE_ERROR. I wrote a patch to fix the error reporting from NSS and I suspect that NSS will start returning CKR_ENCRYPTED_DATA_INVALID when we hit this bug with that patch. If so, that will verify that my thinking is correct and that the bug is likely somewhere else. This NSS patch will be useful for diagnosing the problem but it won't be necessary to include this patch in mozilla-central's NSS.
tracking-fennec: 2.0b3+ → ---
I removed the blocking-fennec flag by accident.
tracking-fennec: --- → ?
tracking-fennec: ? → 2.0b3+
I tried to replicate this using the test_crypto_crypt script, but only got this far:

http://rnewman.pastebin.mozilla.org/883941

:(
Also, we can verify whether changes in NSS are to blame by simply building trunk and then replacing the NSS DLLs (softokn3, freebl3, nss3, nssutil3) with the DLLs from an earlier nightly. J-PAKE won't work in this configuration but everything else should work exactly the same.

philiKON said that we can reproduce the problem reliably by backing out https://bug618179.bugzilla.mozilla.org/attachment.cgi?id=496715 but I don't know the STR after reversing that patch. Please post the STR to this bug.
I reproduced this by decrypting in a tight loop.


15:36:31 <@rnewman> TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "cipher finalize failed"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://services-crypto/WeaveCrypto.js :: <TOP_LEVEL> :: line 557"  data: no]
15:36:57 <@rnewman> http://rnewman.pastebin.mozilla.org/884026
15:37:03 <@rnewman> then run test_crypto_crypt
15:37:30 <@rnewman> that is, simply decrypting the same thing in a tight loop will cause this failure
15:38:00 <@rnewman> in this case, after 1058 iterations succeeded
Richard, were you able to reproduce this before or only after the memoization patch?
(In reply to comment #29)
> Richard, were you able to reproduce this before or only after the memoization
> patch?

Before. Comment 28 is simply:

* Disable the memoization I added
* Run a decrypt 6,000 times.

It dies reliably for me between 1,000 and 5,000 iterations.

The decryption is just:

function multiple_decrypts(iterations) {
  let iv = cryptoSvc.generateRandomIV();
  let key = cryptoSvc.generateRandomKey();
  let cipherText = cryptoSvc.encrypt("Hello, world.", key, iv);
  
  for (let i = 0; i < iterations; ++i) {
    let clearText = cryptoSvc.decrypt(cipherText, key, iv);
    do_check_eq(clearText + " " + i, "Hello, world. " + i);
  }
}

Note that the inputs to decrypt are not NSS objects, so this could happen in the real world. I don't know why it doesn't.

The memoization I added affects the internals of decrypt (_commonCrypt), and results in this running for 10,000 iterations without the error.
Brian and I just spent a good while reproducing this. We got yesterday's nightly which doesn't have the bugfix for bug 618179 and added debugging output to WeaveCrypto encrypt() and decrypt(). Thanks to bug 618179 we were then able to reliably reproduce the error using the STR as originally described in comment 6:

1. Go to My Sync Key, generate a new one and confirm.
2. Run into bug 618179 which is only consolable by restarting the browser
3. Trying to outsmart Sync by just Disconnecting and then Connecting again.
4. Sync. It fails. Observe cipher finalize failed error.

Changing the Sync Key called WeaveCrypto.encrypt() on the new keys WBO. The Disconnect + Connect step would then call WeaveCrypto.decrypt() on that same data. We verified that whatever WeaveCrypto.encrypt() returned on the first step was passed in to decrypt(). The problem must have therefore been inside WeaveCrypto.

"cipher finalize failed" is thrown when the padding is off. This can happen when the key used is wrong. To look at what the padding would be, we changed WeaveCrypto.decrypt() to retry decryption with the unpadded mechanism if "cipher finalized failed" was encountered. This revealed indeed that the padding was not right. So despite being passed in the right values as parameters, decrypt() ended up using the wrong key.

This was a strong indicator that Richard's SECItem memoization patch from bug 610914 may have been responsible. We disabled the memoization and redid the above steps four times, but couldn't reproduce the error anymore. Given the following evidence, it's pretty clear to me that we need to back out the memoization:

* the parameter values to decrypt() were correct but the cipher operation was using the wrong key
* the random occurrences of the bug outside the reliable STR
* the fact it was reliably reproducible my changing the Sync Key (lack of cache invalidation?)
* but most importantly: the fact that the error could no longer be reproduced after memoization was disabled.

tl;dr: I'm going to back out bug 610914.
(In reply to comment #31)

> tl;dr: I'm going to back out bug 610914.

Good sleuthing!
Backed out Bug 610914: https://hg.mozilla.org/services/fx-sync/rev/6c7c1a5d5967
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Blocks: 618492
(In reply to comment #30)
> (In reply to comment #29)
> * Disable the memoization I added
> * Run a decrypt 6,000 times.
> 
> It dies reliably for me between 1,000 and 5,000 iterations.
>
> The memoization I added affects the internals of decrypt (_commonCrypt), and
> results in this running for 10,000 iterations without the error.

Unfortunately, a failure of Pl11_DigestFinal just means that something bad happened that caused its inputs to be bad. I filed a bug to improve the error reporting in the case where the padding is bad (bug 618418). 

A possible explanation of this tight-loop behavior is that, without memoization, we run out of memory somewhere and cause a bad (NULL) input to C_DecryptFinal or we create a bad key object because we run out of memory. I will investigate this separately. This would explain why the memoization allowed it to go for more iterations--the memoized version used less memory and there was probably no GC happening during the execution of the loop which means memory never got freed and thus addresses never got reused during the loop. This again points to the problem being the memoization's use memory addresses as part of the memoization key.
Status: RESOLVED → UNCONFIRMED
Ever confirmed: false
Resolution: FIXED → ---
Sorry, I didn't intend to change the resolution.
Status: UNCONFIRMED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
This doesnt happen anymore on the nightlies that i see.   Marking verified on Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b13pre) Gecko/20110317 Firefox/4.0b13pre
Status: RESOLVED → VERIFIED
Component: Firefox Sync: Crypto → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: