Closed Bug 564011 Opened 15 years ago Closed 15 years ago

SQLITE3 backend has object ID race on keygen (even in single thread).

Categories

(NSS :: Libraries, defect)

3.12.6
defect
Not set
critical

Tracking

(Not tracked)

VERIFIED FIXED
3.12.7

People

(Reporter: sysadmin, Assigned: rrelyea)

References

(Blocks 1 open bug)

Details

(Whiteboard: FIPS)

Attachments

(5 files, 3 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; it; rv:1.9.1.9) Gecko/20100317 SeaMonkey/2.0.4 Build Identifier: 3.12.6 Calling C_GenerateKeyPair multiple times (and also with different key types!) returns the *same* object handle! Half of the generated keys are lost/overwritten. Calling the function simultaneously by multiple threads creates a real disaster. All works perfectly with the old BDB backend. Reproducible: Always Steps to Reproduce: Run the attached test program, that generates 30 RSA keys and 30 EC keys. Actual Results: With SQLITE3 backend: %./a.out /usr/local/lib/nss/libsoftokn3.so configdir='sql:.' 12345678 RSA -1795042673 -90903455 RSA -1164645278 -90903455 RSA -1164645278 -90903453 RSA -1164645276 -90903453 RSA -1164645276 -90903451 RSA -1164645274 -90903451 RSA -1164645274 -90903449 RSA -1164645272 -90903449 RSA -1164645272 -90903447 ECC -1164645270 -90903447 RSA -1164645268 -90903445 RSA -1164645268 -90903443 RSA -1164645266 -90903443 RSA -1164645266 -90903441 RSA -1164645264 -90903441 RSA -1164645264 -90903439 RSA -1164645262 -90903439 ECC C_GenerateKeyPair 48 ECC -1164645262 -90903437 ECC -1164645260 -90903437 ECC -1164645260 -90903435 ECC -1164645258 -90903435 RSA -1164645258 -90903433 RSA -1164645256 -90903433 RSA -1164645256 -90903431 RSA -1164645254 -90903431 RSA -1164645254 -90903429 ECC -1164645252 -90903429 RSA -1164645252 -90903427 ECC -1164645250 -90903427 ECC -1164645250 -90903425 ECC -1164645248 -90903425 ECC -1164645248 -90903423 ECC -1164645246 -90903423 ECC -1164645246 -90903421 ECC -1164645244 -90903421 ECC -1164645244 -90903419 ECC -1164645242 -90903419 ECC -1164645242 -90903417 ECC -1164645240 -90903417 ECC -1164645240 -90903415 ECC -1164645238 -90903415 ECC -1164645238 -90903413 ECC -1164645236 -90903413 ECC -1164645236 -90903411 ECC -1164645234 -90903411 ECC -1164645234 -90903409 ECC -1164645232 -90903409 ECC -1164645232 -90903407 ECC -1164645230 -90903407 ECC -1164645230 -90903405 ECC -1164645228 -90903405 RSA -1164645228 -90903403 RSA -1164645226 -90903403 RSA -1164645226 -90903401 RSA -1164645224 -90903401 RSA -1164645224 -90903399 RSA -1164645222 -90903399 RSA -1164645222 -90903397 RSA -1164645220 -90903397 Expected Results: With BDB backend: %./a.out /usr/local/lib/nss/libsoftokn3.so configdir='.' 12345678 RSA -1766937837 -827413741 RSA -1832945932 -893421836 RSA -1862048249 -922524153 RSA -1852166566 -912642470 RSA -1771687628 -832163532 RSA -1859749385 -920225289 RSA -1797889382 -858365286 ECC -1816936244 -877412148 ECC -1875615269 -936091173 RSA -1836008136 -896484040 ECC -1778997149 -839473053 ECC -1771380607 -831856511 RSA -1874015336 -934491240 ECC -1852978087 -913453991 ECC -1814593665 -875069569 ECC -1840635597 -901111501 RSA -1763881890 -824357794 ECC -1841696958 -902172862 ECC -1818447760 -878923664 RSA -1780874406 -841350310 ECC -1811901401 -872377305 RSA -1814322973 -874798877 ECC -1779343239 -839819143 ECC -1867467544 -927943448 RSA -1759277805 -819753709 ECC -1877920776 -938396680 ECC -1744845065 -805320969 ECC -1867408352 -927884256 RSA -1847740256 -908216160 ECC -1876261781 -936737685 ECC -1760721411 -821197315 ECC -1821192545 -881668449 ECC -1854008737 -914484641 ECC -1872495743 -932971647 RSA -1862994671 -923470575 ECC -1846173488 -906649392 ECC -1842191382 -902667286 RSA -1745104177 -805580081 ECC -1804836613 -865312517 ECC -1773628712 -834104616 ECC -1769863224 -830339128 RSA -1838952310 -899428214 ECC -1768950977 -829426881 ECC -1775645016 -836120920 RSA -1808191676 -868667580 ECC -1840679408 -901155312 ECC -1750369936 -810845840 ECC -1874893127 -935369031 RSA -1760258976 -820734880 RSA -1795309529 -855785433 RSA -1826079071 -886554975 RSA -1847844862 -908320766 RSA -1761326732 -821802636 RSA -1792705593 -853181497 RSA -1810165621 -870641525 RSA -1748114513 -808590417 RSA -1760181230 -820657134 RSA -1757935365 -818411269 RSA -1786141411 -846617315 RSA -1821550964 -882026868
Attached file Testcase (obsolete) —
Here is the testcase. One thread creates 30 RSA keypairs and another one 30 EC keypairs.
Blocks: FIPS2010
Whiteboard: FIPS
Target Milestone: --- → 3.13
Version: unspecified → 3.12.6
Attachment #443695 - Attachment mime type: application/octet-stream → text/plain
updating subject to be more descriptive.
Summary: SQLITE3 backend completely broken! → SQLITE3 backend has object ID race on keygen.
To be clear, the problem Alex has found is not that the SQLITE DB is creating objects sequentially, but that the SQLITE DB is creating objects with the same object ID. I also see that when this happens an object ID is skipped. bob
To be more precise: the issue is present even in single-thread. Using multiple threads is worse, because the function may also fails with strange errors or returns inconsistent data.
Summary: SQLITE3 backend has object ID race on keygen. → SQLITE3 backend has object ID race on keygen (even in single thread).
If it happens with a single thread, it's not a race.
You called it a race ;-) You can launch only one of the two threads and the behaviour is the same. I don't know if fixing the single-thread object ID reuse will fix also the rest (function occasionally failing in multi-thread and bad parallelism).
OK there is clearly a race here. sdb_getObjectID() is supposed to return a unique objectID that is not yet used. It cycles through a list of object ID's it has, making sure it's not already created. There are logs around getting the object id and creating the object, but those locks are only active if SQLITE_UNSAFE_THREADS is set. I don't believe it's set for modern implementations of SQLITE. Turning that one (you sent set SQLITE_UNSAFE_THREADS in your environment and rebuilt to get that). The problem, however is not this race. In sql databases each object has a unique lower set of bits id. The upper 2 bits are set to indicate if the object is a TOKEN object (always "1" in this case) and if the object should be in the key or cert DB ("0" = key "1"= cert). Public keys are stored in the "cert" db (which is really the public objects db). running alex's test with some changes which output the object ID as a hex value rather than a decimal I see: RSA i=0 0xa925c6ac 0xe925c6ab RSA i=1 0xa925c6ac 0xe925c6ab RSA i=2 0xa925c6ae 0xe925c6ad RSA i=3 0xa925c6ae 0xe925c6ad RSA i=4 0xa925c6b0 0xe925c6af RSA i=5 0xa925c6b0 0xe925c6af RSA i=6 0xa925c6b2 0xe925c6b1 RSA i=7 0xa925c6b2 0xe925c6b1 RSA i=8 0xa925c6b4 0xe925c6b3 RSA i=9 0xa925c6b4 0xe925c6b3 RSA i=10 0xa925c6b6 0xe925c6b5 RSA i=11 0xa925c6b6 0xe925c6b5 Clearly we are getting unique values for the public and private keys. I've turned threads off and put a 30 second sleep between keygens without having an effect on the outcome, so clearly the above race is not the issue. I suspect it's probably our code that is trying to detect import of an identical object into the token and turning a create into an update that is causing us grief here. bob
OK, it is a problem of the code meant to handle common updates. The CKA_ID for these keys are set to NULL. This means every keygen will match any other keygen. The old DB does not have this issue because the oldDB automatically sets CKA_ID (because it has to, it can't store a different value for CKA_ID). NSS does not show this problem because the NSS keygen function sets the CKA_ID values immediately after the key gen. NOTE: NSS would be susceptible to a race here, where 2 different keygen's are running at the same time, the first generates the key, task switches, and the second goes to generate a key. Both keygen will return identical handles to identical keys in this case. patch included.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Assignee: nobody → rrelyea
I tried the patch and modified the testcase to spawn additional threads (I'm testing it on a 8 cores machine). These are the new result (200 EC P-256 keys and 50 RSA 1024 keys with 4 threads): /usr/bin/time ./a.out /usr/local/lib/nss/libsoftokn3.so configdir='.' 12345678 ... 8.18 real 14.27 user 0.18 sys %/usr/bin/time ./a.out /usr/local/lib/nss/libsoftokn3.so configdir='sql:.' 12345678 ... RSA (1) C_GenerateKeyPair 5 ... RSA (3) C_GenerateKeyPair 5 ... RSA (1) C_GenerateKeyPair 5 ... RSA (3) C_GenerateKeyPair 5 ... 50.25 real 14.03 user 0.64 sys In other words: 1) the IDs are not reused anymore (OK) 2) occasionally the function returns CKR_GENERAL_ERROR (BAD): before this happens, all the threads are blocked. I think (supposition) the sqlite db is locked and no threads can successfully create new keys. After about 10 seconds (the sleep/retry cycle), one keypairgen returns CKR_GENERAL_ERROR and all threads restart to work until the condition happens again So, the user time is quite the same, but the real time is much higher (about 40 seconds for the 4 keypairgen that failed).
Attached file Updated testcase
The updated testcase, that spawns 4 threads and prints also the thread number.
Attachment #443695 - Attachment is obsolete: true
Attached file Backtrace of deadlock
This is the backtrace of the four blocked threads. As I supposed, they are all waiting on db mutex. I hope this will help you finding the bug.
Alex, let's make a separate bug for your deadlock issue. It's clearly different from your previous issue. It looks very much like a deadlock deep inside sqlite itself, so we probably should push the issue upstream to the sqlite maintainer. Some questions, however: Which version of sqlite are you using? Mozilla's, NSS's or one from the OS? If the middle, it might just be a matter of updating to the latest sqlite. bob
Also, was this built with SQLITE_UNSAFE_THREADS set? bob
External, latest release, 3.6.23.1, built thread-safe. (SQLITE_UNSAFE_THREADS doesn't exist anymore)
It does in NSS... If you build NSS with that define, it will do thread locking before calling sqlite. bob
Ops, misunderstood. Rebuilt NSS with that define. The result is worse, much more (and sequentially) CKR_GENERAL_ERROR failures. 286.84 real 22.29 user 1.06 sys
New backtrace with SQLITE_UNSAFE_THREAD enabled.
Robert Relyea asked in comment 13: > Which version of sqlite are you using? Mozilla's, NSS's or one from the OS? In Comment 15, Alex Dupre answered: > External, latest release, 3.6.23.1, built thread-safe. > (SQLITE_UNSAFE_THREADS doesn't exist anymore) I believe that makes this bug invalid. NSS tests, qualifies and supports its releases against certain specific versions of SQLITE3. If it fails with THOSE versions, that's a bug for which the NSS team will take responsiblity. If it fails with some other version, that's a failure of an unsupported combination. You know the old story: "Doctor, it hurts when I do this. Well, then, don't do that!"
That doesn't make the bug invalid, because the behaviour is *identical* with the bundled sqlite.
The race condition seems so: - one thread has a lock on the sqlite db and is trying to acquire the monitor - a second thread has acquired the monitor and is trying to write to the sqlite db This is a stall condition. After the second thread timeouts and fails, the first one continue to do its job.
(I'm an intern at Google working on Chrome.) We ran into these two problems doing some stress-testing of keygen's thread-safety. The CKA_ID patch in comment #9 does fix the first issue for us as well. For the deadlock, it appears the problem is in sftk_signTemplate. Specifically, the block if (handle->type != SFTK_KEYDB_TYPE) { keyHandle = handle->peerDB; } when we switch over to the peerDB, the resulting writes are not done with a transaction open. This breaks what I can see of sdb.c's locking assumptions. (The net result is that we write to the sqlReadDB handle and act quite oddly.) The attached patch opens (and cleans up) a transaction if necessary in sftk_signTemplate. I have not been able to reproduce the stall with this enabled. Since it does mess around with transactions, the code is somewhat frail in places. In particular, I'm worried about future callers to sftk_signTemplate that may already have a transaction on the other database open. Also, should a similar dance be necessary in future, one which takes the transactions in the opposite order could introduce another stall. More importantly, if the certDB transaction later fails, the changes to the signature will not be rolled back. I'm not sure that can be fixed easily, seeing as they live in different databases.
Attachment #452156 - Flags: review?(rrelyea)
Bob: David Benjamin works with me this summer. He ran into this bug independently and spent two days tracking this down. Please review his patch while he remembers the details clearly. Thank you!
Comment on attachment 452156 [details] [diff] [review] When signing attributes to a peer db, remember to create a transaction r=wtc. Please change "Peer" to "PeerDB" in "usingPeer" and "inPeerTransaction".
Attachment #452156 - Flags: review+
I renamed the variables as I suggested in comment 24, and checked davidben's patch in on the NSS trunk (Softoken 3.12.7) and the SOFTOKEN_3_13_BRANCH (Softoken 3.13). Checking in sftkdb.c; /cvsroot/mozilla/security/nss/lib/softoken/sftkdb.c,v <-- sftkdb.c new revision: 1.25; previous revision: 1.24 done Checking in sftkdb.c; /cvsroot/mozilla/security/nss/lib/softoken/sftkdb.c,v <-- sftkdb.c new revision: 1.24.4.2; previous revision: 1.24.4.1 done
Attachment #452156 - Attachment is obsolete: true
Attachment #453578 - Flags: review?(rrelyea)
Attachment #452156 - Flags: review?(rrelyea)
r=wtc. I edited the comment in sftkdb_lookupObject to note that it is the caller rather than that function that will create the object. I think CKR_OBJECT_HANDLE_INVALID is not an appropriate return code for sftkdb_getFindTemplate because that function doesn't deal with object handles. CKR_ATTRIBUTE_VALUE_INVALID seems better, because the CKA_ID attribute in ptemplate is empty. Bob, what do you think? Since this return code is only used by sftkdb_getFindTemplate to report this special condition to its caller, sftkdb_lookupObject, as long as this return code is not used under any other condition, it doesn't matter what the return code is. Right now, sftkdb_getFindTemplate returns CKR_OK or CKR_TEMPLATE_INCOMPLETE under other conditions. I checked in this patch on the NSS trunk (Softoken 3.12.7) and SOFTOKEN_3_13_BRANCH (Softoken 3.13). Checking in sftkdb.c; /cvsroot/mozilla/security/nss/lib/softoken/sftkdb.c,v <-- sftkdb.c new revision: 1.26; previous revision: 1.25 done Checking in sftkdb.c; /cvsroot/mozilla/security/nss/lib/softoken/sftkdb.c,v <-- sftkdb.c new revision: 1.24.4.3; previous revision: 1.24.4.2 done
Attachment #443752 - Attachment is obsolete: true
Alex, it would be nice if you could verify the two patches in this bug fixed the bug for you. Thanks.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Target Milestone: 3.13 → 3.12.7
I'll verify it soon, in the next days.
The testcase now passes successfully. The bug is fixed. I currently cannot say if the patch introduces other deadlocks elsewhere or not.
Alex, thanks a lot for verifying the fix. David Benjamin has been reviewing the NSS softoken's locking of sqlite and sqlite's internal locking in the past couple of days. I encouraged him to write up and publish his findings as an NSS technical note.
Status: RESOLVED → VERIFIED
Comment on attachment 453578 [details] [diff] [review] When signing attributes to a peer db, remember to create a transaction, by David Benjamin (checked in) r+ rrelyea Good catch!
Attachment #453578 - Flags: review?(rrelyea) → review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: