Closed Bug 149854 Opened 23 years ago Closed 23 years ago

CRL cache to avoid DER CRL decodes on every cert verification

Categories

(NSS :: Libraries, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: julien.pierre)

References

Details

Attachments

(1 file)

When an application verifies a certificate, for example using CERT_VerifyCert explicitly, or on a server during an SSL client auth handshake, NSS searches all PKCS#11 tokens for a CRL matching the certificate's issuer. If if finds one, the token (softoken) returns the matching DER CRL, which is then decoded, searched, and immediately discarded. Every time there is a new certification verification in the application, the process repeats, and the DER CRL is decoded all over again, even if it hasn't been modified. This is an extremely wasteful use of CPU and memory. The repeated decode operations are not necessary. To give an example of just how bad the impact of this bug is in the real-world : 1) install a CRL with 1200000 entries into cert7.db 2) start NES web server with that DB and set it up for client auth 3) login to the site with your client cert 4) the server's CPU peaks for 3 minutes decoding it 5) go back to 3. As you see, this limits the server - an ultra10 at 440 MHz - to a mere 20 client auth handshakes per *hour*, at peak CPU usage. If a CRL is not used, the same server will be able to do about 30000 handshakes per hour, 3 orders of magnitude more. While the first 3 minute decoding step is inevitable, the 19 subsequent ones can be avoided. This would mean that the server with revocation would perform about the same with CRLs as without, once it had got past the initial long CRL decoding. The solution to avoid repeated CRL decoding is to implement a cache of CRL objects. The cache should keep track of both decoded and DER versions of CRL objects by issuer. This cache should be above the pk11wrap layer and below the cert verification layer. I will post more details about the design of this cache as they become available.
Taking bug.
Severity: normal → major
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → 3.6
Really taking bug
Assignee: wtc → jpierre
Status: ASSIGNED → NEW
Depends on: 162753
Depends on: 162976
I have created an implementation of the cache based on the design we discussed. It isn't complete yet, but it is functional enough to allow it to be tested. The two missing parts are : - code to check that the PKCS#11 object has been deleted in the token. This means currently the CRL is cached forever - a hash table of serial numbers to entries. Currently the cache is doing linear searches Hopefully I can complete these parts tomorrow. For now, I have been able to successfully test this patch with NES 6.1. It shows that client cert validation is much faster, but there are many other parts in the code path besides cert validation that are more expensive than the cache lookup, so this turned out not to be a good test. The benchmark I decided to use is a modified verification of certutil that does multiple iterations and collects the time spent on doing verification the first time and for subsequent operations. Here are the results on a Solaris debug build. First, when trying to verify a good certificate that is not on the CRL. This is the worst case, but also the most common : the entire array is traversed linearly. (strange)/export/home/jpierre/nss/36/mozilla/dist/SunOS5.8_DBG.OBJ/bin{85} ./good Time for initial uncached verification : 0:58.177100 Time for 99 cached verifications : 0:52.425020 Average per cached verification : 0: 0.529540 certutil: certificate is valid 73.70u 1.70s 1:50.99 67.9% As you can see, the initial uncached verification takes 58 seconds. This time is spent looking up the 26 MB CRL, fully decoding it, then traversing the array of 1.2 million entries and not find a match. [Bob, this lookup + decode was only taking 23 seconds two days ago on the debug build. The decode part is 11s. This means the softoken lookup part went up from 12s to 47s over the last couple days. I don't know why yet. This might be a problem in my environment. But this decrease in performance in the uncached case doesn't concern the cache which is the subject of this bug] Then, the subsequent cached lookups take about a half second each. This time is mostly spent traversing the entire array of entries. With a revoked certificate, the situation is much better : (strange)/export/home/jpierre/nss/36/mozilla/dist/SunOS5.8_DBG.OBJ/bin{84} ./revoked Time for initial uncached verification : 0:55.290150 Time for 99 cached verifications : 0: 0.143320 Average per cached verification : 0: 0. 1440 certutil: certificate is invalid: Peer's Certificate has been revoked. This is most likely because the serial number is listed early in the CRL array and therefore the linear search matches quickly and stops. In summary, the cache improves performance by a factor of 40000 on the revoked certificate, and about 100 on the good certificate. Keep in mind this was in the debug build - I expect the improvement to be less in the optimized build. But I still have some bugs in the cache and the optimized build dumps core right now, so I couldn't measure performance with it. By implementing the hash table of serial numbers to entries, both cases of revoked and valid certificates would end up taking the same short amount of time to look up in the CRL. Therefore, the improvement would be the maximum in both cases. It is also important to note that in the cache, the CRL is protected by a lock, unlike in the past where each thread had its own full separate copy of it. This means that if we keep it as an array, and we end up having long linear searches (most users on a server are not revoked, so this is the common case), then this means all certificate verifications on the same issuer in other threads are effectively serialized. With a lookup time of 0.5 s to traverse the full CRL, this can have a very bad effect in a multiprocessor server. Of course, this is probably a moot point compared to the current horrid performance of CRLs without the cache ... But since we are trying to solve the problem, I think we should solve it completely, and not just half of it.
Sounds like the CRL cache entry wants to have a reader/writer lock rather than an ordinary lock. There is an implementation of reader/writer locks in lib/util/nssrwlk*.[ch] that may be suitable.
Yes, the reader/writer lock would be useful. I will look at using it. FYI, about the 58s time last night for the original lookup - it was probably something with the state of the disk cache on my Solaris box, because today I am running the same exact test script with identical binaries and the inital uncached verification time is back down to 22s.
It turns out the crashes I was seeing last week were unrelated to my cache code. It was working fine. The R/W lock implementation that Nelson suggested indeed turned out to be quite suitable. The test below validates a good certificate, which means it has to traverse the whole array, since I don't have the hash table implemented yet. It was performed on a 2-way machine. The 200 validations are done in two threads, 100 in each thread. In the first test below, I use PRLock, which serializes access to the CRL. (nes02)/h/strange/export/home/jpierre/nss/36/mozilla/dist/SunOS5.8_OPT.OBJ/bin{44} ./good Time for initial uncached verification : 0:11.898470 Time for 200 cached verifications : 1: 7. 87600 Average time per cached verification : 0: 0.335430 certutil: certificate is valid 73.51u 1.20s 1:19.63 93.8% In the second test, I use NSSRWLock in read mode when reading, write mode when updating. There were no writes during this test. (nes02)/h/strange/export/home/jpierre/nss/36/mozilla/dist/SunOS5.8_OPT.OBJ/bin{47} ./good Time for initial uncached verification : 0: 7.343350 Time for 200 cached verifications : 0:36. 10650 Average time per cached verification : 0: 0.180050 certutil: certificate is valid 76.77u 1.15s 0:43.85 177.6% In the second test, both CPUs were peaked 100%. And as you can see, the time went down from 69s to 36s, not quite half, but close.
Scratch that. I wasn't running the lock properly. I wasn't running with any lock at all. Maybe it's time to head home.
OK, this time I really got the rwlock code in there. And actually it performs about the same as no locking, with no updates going in the background. (nes02)/h/strange/export/home/jpierre/nss/36/mozilla/dist/SunOS5.8_OPT.OBJ/bin{59} ./good Time for initial uncached verification : 0: 7.291630 Time for 200 cached verifications : 0:37.465910 Average time per cached verification : 0: 0.187320 certutil: certificate is valid 79.71u 1.00s 0:45.27 178.2%
The NSSRWLock is fine if the check is long-lived, ie. as it currently is with the linear search of the array of entries. In the tests I ran last night, one verification was taking about 330,000 CPU-microseconds . Therefore, there wasn't much lock contention. However, when the hash-table is implemented (very soon), the lookup will be much shorter. I don't have the hash table implemented yet, but I have a test that comes very close to it - the verification of a cert listed on the CRL, which always completes in under 1000 microseconds, as it is found on the CRL. So I ran this revoked cert test with the NSSRwLock, and I found that things weren't nearly as good. Here are the results : - 100 threads doing 100 iterations each, 2 CPUs enabled : Cached lookups took 7.9s to complete - 100 threads doing 100 iterations each, 1 CPU enabled : Cached lookups took 5.5s to complete - 2 threads doing 10000 iterations each, 2 CPUs enabled : Cached lookups took 9.9s to complete - 2 threads doing 10000 iterations each, 1 CPU enabled : Cached lookups took 9.6s to complete This shows that with 100 threads, the NSSRWLock has negative scalability of about 33% when a second CPU is added on Solaris. With two threads it is roughly on par between 1 and 2 CPUs, but there is no benefit from adding a second CPU.
I have completed a functional implementation of the CRL cache. The items that were missing from last week that are now implemented are : 1) the cache always checks the token first to see if the CRL object still exists 2) there is an optional reader/writer lock to protected the decoded CRL - or strictly speaking, the "CRLDPCache" which is a distribution point cache object. I still have second thoughts about this, so this is #ifdef'ed by USE_RWLOCK . When this macro is not defined, a PRLock is used instead. 3) serial numbers are matched through a hash table instead of a linear search 4) there is a new ShutdownCrlCache() cleanup function that frees the CRL cache. It notably deletes all issuer certs and CRL objects used by the cache at NSS_Shutdown() time I have verified that performance is equal for both revoked and good certs. Here are the results on lookups in a 26 MB, optimized build, on a single AMD CPU on OS/2, using the reader/writer lock implementation. [e:\dev\nss\36\mozilla\dist\os22.45_icc_opt.obj\bin]good Time for initial uncached verification : 0: 7. 61028 Time for 10000 cached verifications : 0: 6.238655 Threads : 100. Iterations : 100. Average time per cached verification : 0: 0. 617 E:\DEV\NSS\36\MOZILLA\DIST\OS22.45_ICC_OPT.OBJ\BIN\CERTUTIL.EXE: certificate is valid [e:\dev\nss\36\mozilla\dist\os22.45_icc_opt.obj\bin]revoked.cmd Time for initial uncached verification : 0: 7. 48772 Time for 10000 cached verifications : 0: 6.674798 Threads : 100. Iterations : 100. Average time per cached verification : 0: 0. 657 E:\DEV\NSS\36\MOZILLA\DIST\OS22.45_ICC_OPT.OBJ\BIN\CERTUTIL.EXE: certificate is invalid: Peer's Certificate has been revoked. [e:\dev\nss\36\mozilla\dist\os22.45_icc_opt.obj\bin] Keep in mind the time for the uncached looked is optimistic - it is only so short because of the 15,000 rpm SCSI disk. On my Ultrasparc 10 with IDE disk, it typically takes over 20s. I also checked that it worked fine on Solaris.
This patch contains all necessary changes for the CRL cache. Most of the new code is at the end of certdb/crl.c
I have attached a patch which contains the implementation of the CRL cache. It is fully function and I have fixed all the bugs that I know of. But the source hasn't been reviewed by anyone yet, or double-checked. I am not very happy with the code (function naming, organization). I was really struggling when writing it - it should have been written in C++ rather than C. There is still a scalability problem on SMP that remains a mystery. Here is an example of a result on a dual CPU box, when verifying my Thawte certificate. With two CPUs enabled : Time for initial uncached verification : 0: 0.129470 Time for 10000 cached verifications : 0:41.128490 Threads : 100. Iterations : 100. Average time per cached verification : 0: 0. 4110 With one CPU enabled : Time for initial uncached verification : 0: 0. 99230 Time for 10000 cached verifications : 0:26.120070 Threads : 100. Iterations : 100. Average time per cached verification : 0: 0. 2610 As you see, with a single CPU, we complete the 10000 verifications in 2/3rd of the time as with two CPUs. I have monitored the CPU usage with vmstat. Both CPUs are peaked at all times - there isn't any idling. In single CPU mode, the lone CPU is also peaked. FYI, most of the time in this test is not spent in the cache, but rather in the certificate signature verification. My Thawte cert has a 2048-bit key. The previous tests I conducted were with certs that had a key of only 512 bits. This is why they could verify in only about 600 microseconds - vs 2600 microseconds for this Thawte cert with a 2048 bits key. I will write a new test case that only verifies the CRL rather than perform the whole certificate verification, to eliminate this variable from the tests, as well as others, such as the number of certs in the cert chain that are getting verified.
Depends on: 165639
FYI, I also measured a very noticeable improvement with PSM. For my test, I used Mozilla 1.1 with my current cert/key database - which has about 10 certs, including AOL certs and Thawte certs, and AOL and Thawte CRLs. I measured the time to open PSM, on the 400 MHz OS/2 machine in my office. - with NSS 3.6 tip : 15 seconds - with NSS 3.6 + CRL cache : 5 seconds This is in large part because PSM is doing 40 verifications for each cert, and therefore pulls the CRL 40 times. So the CRL cache is a huge win for PSM too, not just for servers.
Code checked in. NSS 3.6 is now feature complete
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: