Closed Bug 98926 Opened 24 years ago Closed 23 years ago

RFE: PKCS#11 session debug tool

Categories

(NSS :: Libraries, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lord, Assigned: wtc)

Details

Attachments

(4 files, 2 obsolete files)

While testing several smart cards, I discovered that vendors implement some functions differently (probably incorrectly at that). For example, one vendor's driver seems to mangle the list of certs on the token, which causes confusing UI for the user in PSM's Cert Manager. Another vendor's driver causes SSL pages at sites like Amazon to be painfully slow. It would be great to have some way to get a complete log of all PKCS#11 traffic between the software and the driver. Such a tool would allow us to tell if one vendor is performing operations differently than another vendor.
Status: NEW → ASSIGNED
Target Milestone: --- → Future
This sounds like a very good idea. It would require hooks within the NSS PKCS#11 interface code . The debugging code could also include some timing information to help profiling third party or the internal PKCS module if needed. It would probably have a pstack-like output for each call to the PKCS#11 library as well as timestamps before and after, written to a log file, perhaps one log file per PKCS module. This might be a good first project for me if this RFE gets prioritized.
Please make sure any new code for this purpose doesn't slow down the PKCS#11 code even more. I'd suggest making it all #ifdef DEBUG.
Steve Parkinson suggested that we create a PKSC#11 shim module which looked like the token to NSS, and looked like NSS to the token. That way you could use it on live, real, non-debug systems (like on a user's desktop). That shim would spit out the raw data and timestamps.
Assigned the bug to Julien.
Assignee: wtc → jpierre
Status: ASSIGNED → NEW
OS: Windows 2000 → All
Hardware: PC → All
Severity: normal → enhancement
Priority: -- → P3
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
I had the desire to have this recently, and found a simple way to code it up. After the call to get the function list from cryptoki, I insert a new function list table that intercepts all calls to the module and does logging. I defined all the logging functions by running a simple perl script over ck.api. I agree with what Nelson said in comment #2, I had already #ifdef DEBUG'ed the code. This can be extended to write more specific info for each call, I just dump the value of the arguments and the return value. As it is currently written, it can only work with one module at a time. To activate it, you set NSPR_LOG_MODULES=all:5, NSPR_LOG_FILE=<file>, and NSS_DEBUG_MODULE=<module name>. Here is sample output from certutil -L: 1024[80661c0]: Loaded library a.out (init) 1024[80661c0]: C_Initialize 1024[80661c0]: pInitArgs = 4010ab78 1024[80661c0]: rv = 00000000 1024[80661c0]: C_GetInfo 1024[80661c0]: pInfo = bffff470 1024[80661c0]: rv = 00000000 1024[80661c0]: C_GetSlotList 1024[80661c0]: tokenPresent = 00000000 1024[80661c0]: pSlotList = 00000000 1024[80661c0]: *pulCount = 0 1024[80661c0]: rv = 00000000 1024[80661c0]: C_GetSlotList 1024[80661c0]: tokenPresent = 00000000 1024[80661c0]: pSlotList = 08088850 1024[80661c0]: *pulCount = 2 1024[80661c0]: rv = 00000000 1024[80661c0]: C_GetSlotInfo 1024[80661c0]: slotID = 00000001 1024[80661c0]: pInfo = bffff3c0 1024[80661c0]: rv = 00000000 1024[80661c0]: C_GetTokenInfo 1024[80661c0]: slotID = 00000001 1024[80661c0]: pInfo = bffff300 1024[80661c0]: rv = 00000000 1024[80661c0]: C_GetMechanismList 1024[80661c0]: slotID = 00000001 1024[80661c0]: pMechanismList = 00000000 1024[80661c0]: *pulCount = 27 1024[80661c0]: rv = 00000000 1024[80661c0]: C_GetMechanismList 1024[80661c0]: slotID = 00000001 1024[80661c0]: pMechanismList = 08089f88 1024[80661c0]: *pulCount = 86 1024[80661c0]: rv = 00000000 1024[80661c0]: C_OpenSession 1024[80661c0]: slotID = 00000001 1024[80661c0]: flags = 00000004 1024[80661c0]: pApplication = 08089e00 1024[80661c0]: Notify = 400ae248 1024[80661c0]: phSession = bffff2f0 1024[80661c0]: rv = 00000000 1024[80661c0]: C_FindObjectsInit 1024[80661c0]: hSession = 00000001 1024[80661c0]: pTemplate = bffff390 1024[80661c0]: ulCount = 1 1024[80661c0]: rv = 00000000 1024[80661c0]: C_FindObjects 1024[80661c0]: hSession = 00000001 1024[80661c0]: phObject = bffff360 1024[80661c0]: ulMaxObjectCount = 1 1024[80661c0]: *pulObjectCount = 808464432 1024[80661c0]: rv = 00000000 1024[80661c0]: C_FindObjectsFinal 1024[80661c0]: hSession = 00000001 1024[80661c0]: rv = 00000000
Attached patch patch to insert logging code (obsolete) — Splinter Review
Comment on attachment 83377 [details] [diff] [review] code to implement PKCS#11 logging Ian, You are using the 0x%08x format to print pointers. The portable way to print pointers is to use the %p format. Does the 0x%08x format break 64-bit pointers?
It should not break them, but it does fail to have the desired effect of printing leading zeros. Since that is the case, I'll remove the minimum field width specifiers and use %p for all argument types ending in _PTR. Note: I just generated this from a perl script. We can take the time to make the logging more specific for each function (printing CK_BYTE_PTR's as a string, for example).
Comment on attachment 83376 [details] [diff] [review] patch to insert logging code Ian, I suggest one minor change to this patch. Instead of #ifdef DEBUG, I suggest that you use #ifdef DEBUG_MODULE and add the following at the beginning of this file: #ifdef DEBUG #define DEBUG_MODULE 1 #endif This will make it easier to do an optimized build of NSS with your debug module code. Of course, if the overhead of your debug module code when it is disabled is negligible, we might as well just include it in both debug and optimized builds.
The peformance impact of this patch is negligible. If the environment variable NSS_DEBUG_MODULE is not set, the worst that happens is a strcmp for every module we load. In that sense, it is actually safe to include this code in all builds, including optimized. However, the logging code does take space, so it will inflate the executable size. Perhaps that is worth it? It may be to our advantage to include this in all builds -- browser users could simply turn on environment variables, and then send us traces of smart card failures.
Is PR_LOG thread-safe ? Also, I don't see any thread information being logged in this patch. I think that will be needed for most applications. Lastly, for servers that do a lot of operations (eg. during a benchmark) I think it would be desirable to have a buffered logging scheme and flush it to disk only when the buffer gets full, similar to the access log in web server.
I think this is excellent! I think the code will go a long way to improve the debuggability and usability of PKCS#11 modules. I would suggest that this code be only enabled in debug builds of NSS. The reason being that I think that more information would be added in the future (decoding of mechanisms, dumping objects), and I would not want the size to be a factor in allowing these changes. I would suggest adding a timer so we have some indications of how long the operations took. (indicate time of op start, end) Julien: I think it's still valuable to have immediate logs. Sometimes an operation hangs, and you want to find out why as it happens.
> Is PR_LOG thread-safe ? AFAIK. > Also, I don't see any thread information being logged in this patch. I think > that will be needed for most applications. IINM, the first column of numbers is milliseconds since execution began, and the second [bracketed] column is a thread id of some kind. > Lastly, for servers that do a lot of operations (eg. during a benchmark) I think > it would be desirable to have a buffered logging scheme and flush it to disk > only when the buffer gets full, similar to the access log in web server. This can be done with the PR_LOG API (PR_SetLogBuffering, PR_LogFlush).
I was mistaken. It looks like the first column of numbers is also some kind of thread ID (perhaps OS, while the second is NSPR's?). Unfortunately, the documentation does not describe the output format. It would be easy enough to add a call to PR_Now when the function is entered, and another when it is finished. That, combined with Wan-Teh's suggestion in comment #11, means we could do special optimized builds with logging included to time cryptoki calls. Of course, the logging would cause artificial slowdown, but it would still be useful.
Ian wrote: > It looks like the first column of numbers is also some kind of > thread ID (perhaps OS, while the second is NSPR's?). The first column of numbers is some kind of thread ID. On Unix it is the pthread_t value (that is, the OS thread ID). On other platforms (Windows, Mac, OS/2, etc.) it is an integer thread ID, starting from 1, that is maintained by NSPR. The second column of numbers (or rather pointers) is the NSPR thread ID (the PRThread * pointer value). > Unfortunately, the documentation does not describe the output > format. I just opened bug 144449 about this problem.
Attachment #83376 - Attachment is obsolete: true
sample output: 1024[805de30]: C_GetAttributeValue 1024[805de30]: hSession = 0x1000001 1024[805de30]: hObject = 0xf55e721f 1024[805de30]: pTemplate = 0xbffff410 1024[805de30]: ulCount = 7 1024[805de30]: rv = 0x0, time(ms) = 1 1024[805de30]: C_GetAttributeValue 1024[805de30]: hSession = 0x1000001 1024[805de30]: hObject = 0xf55e721f 1024[805de30]: pTemplate = 0xbffff410 1024[805de30]: ulCount = 7 1024[805de30]: rv = 0x0, time(ms) = 2 1024[805de30]: C_OpenSession 1024[805de30]: slotID = 0x1 1024[805de30]: flags = 0x4 1024[805de30]: pApplication = 0x8081970 1024[805de30]: Notify = 0x400aeb20 1024[805de30]: phSession = 0xbffff300 1024[805de30]: rv = 0x0, time(ms) = 0 1024[805de30]: C_DigestInit 1024[805de30]: hSession = 0x2 1024[805de30]: pMechanism = 0xbffff340 1024[805de30]: rv = 0x0, time(ms) = 0 1024[805de30]: C_DigestUpdate 1024[805de30]: hSession = 0x2 1024[805de30]: pPart = 0x8088620 1024[805de30]: ulPartLen = 162 1024[805de30]: rv = 0x0, time(ms) = 0 1024[805de30]: C_DigestFinal 1024[805de30]: hSession = 0x2 1024[805de30]: pDigest = 0x8087ba8 1024[805de30]: rv = 0x0, time(ms) = 0 1024[805de30]: C_CloseSession 1024[805de30]: hSession = 0x2 1024[805de30]: rv = 0x0, time(ms) = 0 Note that the intervals are usually very short, but at least this provides a mechanism to find operations that do take excessively long.
Attachment #83377 - Attachment is obsolete: true
Okay, I have decided there are two different goals here, 1) logging activity of a PKCS#11 module, and 2) performance measurements of the softoken. Both can be addressed by the same mechanism. Here is what I propose: 1) Logging is only available in debug builds. Uses PR_LOG to dump function arguments and return values. To activate, set NSS_DEBUG_PKCS11_MODULE to name of target module. 2) Softoken metrics. Only available in optimized builds. Measure the number of calls to each PKCS#11 function and the total time spent in each function. Use the results to create prof-like output. To activate, set NSS_DEBUG_PKCS11_MODULE to name of target module, and use NSS_DEBUG_MODULE=1 with gmake.
This data was generated with an optimized selfserv and NSS_DEBUG_PKCS11_MODULE="NSS Internal PKCS #11 Module". The selfserv required client auth on both handshakes, and ran with 20 threads. It was hit 1000 times by a strsclnt with 10 threads using "SSL3 RSA WITH RC4 128 MD5".
Assigned the bug to Ian. Target NSS 3.6, priority P1.
Assignee: jpierre → ian.mcgreer
Component: Tools → Libraries
Priority: P3 → P1
Target Milestone: Future → 3.6
Here is a summary of how the logging code is implemented: #include "prlog.h" static PRLogModuleInfo *modlog = NULL; static CK_FUNCTION_LIST_PTR module_functions; static CK_FUNCTION_LIST debug_functions; static void print_final_statistics(void); ... static PRInt32 counter_C_SignInit = 0; static PRUint32 calls_C_SignInit = 0; CK_RV NSSDBGC_SignInit( CK_SESSION_HANDLE hSession, CK_MECHANISM_PTR pMechanism, CK_OBJECT_HANDLE hKey ) { CK_RV rv; PRIntervalTime start, ival; PR_AtomicIncrement(&calls_C_SignInit); PR_LOG(modlog, 1, ("C_SignInit")); PR_LOG(modlog, 4, (" hSession = 0x%x", hSession)); PR_LOG(modlog, 4, (" pMechanism = 0x%p", pMechanism)); PR_LOG(modlog, 4, (" hKey = 0x%x", hKey)); start = PR_IntervalNow(); rv = module_functions->C_SignInit(hSession, pMechanism, hKey); ival = (PRInt32)(PR_IntervalNow() - start); PR_AtomicAdd(&counter_C_SignInit, ival); PR_LOG(modlog, 1, (" rv = 0x%x\n", rv)); return rv; } ... CK_FUNCTION_LIST_PTR nss_InsertDeviceLog( CK_FUNCTION_LIST_PTR devEPV ) { module_functions = devEPV; modlog = PR_NewLogModule("nss_mod_log"); ... debug_functions.C_SignInit = NSSDBGC_SignInit; ... return &debug_functions; } static void print_final_statistics(void) { int total_calls = 0; PRInt32 total_time = 0; printf("%-25s %10s %11s %10s %10s\n", "Function", "# Calls", "Time (ms)", "Avg. (ms)", "% Time"); printf("\n"); ... total_calls += calls_C_SignInit; total_time += counter_C_SignInit; ... printf("%-25s %10d %10d ", "C_SignInit", calls_C_SignInit, counter_C_SignInit); if (calls_C_SignInit > 0) { printf("%10.2f", (float)counter_C_SignInit / (float)calls_C_SignInit); } else { printf("%10.2f", 0.0); } printf("%10.2f", (float)counter_C_SignInit / (float)PR_IntervalToMilliseconds(total_time) * 100); printf("\n"); ... printf("%25s %10d %10d\n", "Totals", total_calls, total_time); } The code is the same for all functions, this shows the full implementation for C_SignInit. Here are the rules: 1. As Bob requested, performance data is collected in all builds (if logging is enabled). This data is dumped to stdout at shutdown. 2. I changed the function logs to use different levels for the function and its parameters. This way you can set the logging level (via NSPR_LOG_MODULES) to "1" and get output like: 1024[805ef10]: C_Initialize 1024[805ef10]: rv = 0x0 1024[805ef10]: C_GetInfo 1024[805ef10]: rv = 0x0 1024[805ef10]: C_GetSlotList 1024[805ef10]: rv = 0x0 or to "4" for more verbose output: 1024[805ef10]: C_Initialize 1024[805ef10]: pInitArgs = 0x4010c938 1024[805ef10]: rv = 0x0 1024[805ef10]: C_GetInfo 1024[805ef10]: pInfo = 0xbffff340 1024[805ef10]: rv = 0x0 1024[805ef10]: C_GetSlotList 1024[805ef10]: tokenPresent = 0x0 1024[805ef10]: pSlotList = 0x0 1024[805ef10]: pulCount = 0xbffff33c 1024[805ef10]: *pulCount = 0x2 1024[805ef10]: rv = 0x0
This document completely describes the usage of the logger.
As far as the performance mode of the logger, I'd like to add the following comments, which I made in our meeting yesterday. Logging, whether to disk or in-memory, will add significant lock contention, and in heavily multithreaded applications, it will change the performance of the application significantly, and therefore make the collected performance data much less useful. I suggest that we don't use a global lock to serialize access to the log, but rather per-thread locks and condition variables. There needs to be a logging thread. Each application thread would have a log buffer, and no locking or notifying would occur in that thread until the buffer was full, at which point it would notify a condition variable and wait for the flushing of the buffer to the real log by the logging thread.
I think that is well beyond the scope of NSS. Even for debugging purposes, NSS has never maintained its own thread, or asked applications to provide it with one. I guess I'm still not sure what is to gained from all that complexity. If you look at the code in comment #23, you'll see that all actual logging (that is, dumping text output to a buffer) is done with PR_LOG. This code does not even appear in optimized builds, unless you explicitly build with -DFORCE_PR_LOG. In optimized builds, the only additional cost is two calls to PR_Time, an atomic increment, and and atomic add. There are no new locks. If one were to build with -DFORCE_PR_LOG, and then run an optimized build with a server, it seems to me you would end up so inundated with information as to prevent the log from being useful. Perhaps it would have helped with the bug we had were the session counter ran over, but I think there are probably easier ways to track down problems like that than to dump out millions of lines of text. The stated purpose of this RFE was to be able to debug idiosyncracies in PKCS#11 tokens. The kind of problems I would expect it to fix would be like the Rainbow bug we fixed in 3.4.1. Seeing what was wrong there only takes one debug session. If what you propose is required, then I think the best way to do it is what Steve proposed (through Bob) in comment #3: a shim layer that runs as its own library, separate from NSS.
Regarding the comment that this is beyond the scope of NSS, I'm not sure I agree. Consider the zone allocator. It keeps separate locks for separate (groups of) threads, yet it does not require the creation of any new threads. I think the same techniques used in the zone allocator to reduce lock contention could be used for logging purposes.
Ian, I think very large PKCS#11 traces in servers would still be very useful in performance tests. One could write a script to parse the output and display statistics on which are the most frequent calls, how long they last, etc. I believe PR_LOG itself contains locks and perhaps that is the place where we need to make the change. That or the application could provide its own logging callback mechanism which would do the right thing. The big downside of that is that it would require modifying the application to enable logging, rather than just setting environment variables. For that reason, I believe the optimized logging I proposed should be in either NSPR or NSS builds. For optimized builds, it should only be compiled with a specific macro, as the new logging code will likely significantly increased code size.
> I think very large PKCS#11 traces in servers would still be very useful in > performance tests. One could write a script to parse the output and display > statistics on which are the most frequent calls, how long they last, etc. This is my point exactly. The code I have already described in this bug achieves this functionality *without* writing any output to a buffer, until shutdown that is. You're talking about collecting statistics: why do that by dumping text? That seems completely inefficient to me. Again: I think text output should be used for debugging corner cases of functionality with PKCS#11 tokens. Actual statistics should be collected by run-time variables (not text!) and collected/displayed at shutdown.
Ian, When you collect those non-text stats, is there any locking involved or not ? My point was independant of the form that the statistics take - whether it's logging text directly to a file, text to a buffer, or integers in a memory structure. In all cases it seems that you require some sort of locking if you have a global stats holder in your application, unless you use per-thread statistics buckets as I suggested.
PR_AtomicIncrement, PR_AtomicAdd. So locking on architectures that don't support atomic operations.
With PR_AtomicIncrement and PR_AtomicAdd, it seems to me that you can only maintain counters. I'd like to see your code that uses these functions but it doesn't seem to be in your attached patches. With atomics, it seems to me that you can maintain an array of total time counters per C_ function per module, but you cannot maintain a full, growing history of all PKCS#11 functions ever called for each module, with timestamps of start and end points, and thread data. The later can be accomplished with the per-thread buffers without locking, and such a detailed log is much more useful than counters when doing performance work, because it can show contention points over time, whereas global counters cannot as they are missing time and thread data.
The code is summarized in comment #23. I didn't attach it because it is lengthy, and just repeats the same thing for each function. > With atomics, it seems to me that you can maintain an array of total time > counters per C_ function per module, but you cannot maintain a full, growing > history of all PKCS#11 functions ever called for each module, with timestamps of > start and end points, and thread data. True enough. We should either scrap everything on this bug (it can't possibly handle that), or open a new one. I still do not understand why text logging is the way to go for performance analysis. Aren't there tools for doing that? I just can't picture sitting down to parse that kind of a log :-)
Ian, My point was that the PKCS#11 logger could be made to create a full log as I suggest, by using per-thread data for logging. The per-thread data doesn't have to be a text buffer. I'm not particularly fond of text. I'm quite fine with collecting data in binary structures during execution. A linked-list of log entries for each thread would have better performance (less impact on the code being instrumented) since we wouldn't have to do things like an sprintf to append to the text buffer during the run. All that would have to be done for each PKCS#11 call would be to look up the per-thread key for the linked list entry, which would point to the last one, and then append, and set the pointer in the thread data to the new entry. At some point though, the log needs to be exported out of the application after the test run. There has to be a decision on the output data format. I think binary is not a good choice for export as it usually requires complex custom C code to analyze. Text is easier and can be parsed with C. Ideally, a spreadsheet or database output format would be preferable, but I really don't think that really belongs within NSS. The data conversion is nothing that cannot be done externally. So I suggest that we just export the data as text and write external perl scripts as we need to convert the data. Proper and full data collection on the other hand can only be implemented within the NSS code, which is why I suggest that we the logger includes this capability.
> My point was that the PKCS#11 logger could be made to create a full log as I > suggest, by using per-thread data for logging. The code I have written cannot work this way as-is, because it can only work with a single module at a time. Something more complex would have to be introduced for it to work with all modules, not to mention to give per-thread data. > Proper and full data collection on the other hand can only be implemented within > the NSS code, which is why I suggest that we the logger includes this capability. OK, but we have to make a decision: 1) close this bug, because we have something capable of providing PKCS#11 session debugging, per the original RFE, and open a new bug; or 2) start over on this bug, because proper and full data collection is now a requirement.
In comment #34, I meant that text was simple and could be parsed with perl. As far as comment #35 and the requirements, timestamping for profiling purposes was discussed from the beginning in comment #1. Thread data was not explicitly listed, but was implicitly required for profiling in the context of all multithreaded servers that use NSS. I am not saying that your patch isn't useful, it certainly is - it's just that it wouldn't meet the requirements for the sort of performance work we have done in the past year jointly with Sun and other crypto accelerator vendors, which we may have to do again. For this reason I think it would be worthwile to re-implement the logger with the full set of features that was discussed - the ability to trace more than one module, timestamping, profiling, and thread data. In fact, it would be very useful to have your full patch in order to write the new implementation. You are already 1/3 of the way there with your code - what needs to be added is : 1) separate logging by module . This involves creating a per-module structure. That structure would for example hold all the counters for atomic ops thatyou already defined 2) define a structure for an individual log entry with timestamps, type of C_ operation, and other args. I just realized when writing this that each C_ function has different argument types, so we can't have one generic type, unless we have a *huge* union :) This seems to be a good case for using C++ here - abstract base log entry class for manipulation, and different actual log classes for each C_ function. It can be done without C++, with ugly pointers and casts, but it would be a pain to maintain v-tables manually for "serialize" function that will eventually convert the contain of the entry structure to text ... I vote for doing this part more cleanly and efficiently in C++ rather than C. 3) add tracing code using those log entries wherever you have your PR_LOG . There would be a linked-list of the classes of the above-mentioned log entry structure/class. The logging code would then query a thread-private pointer to get the last entry in the linked list, create a new entry, and set the last entry again in the thread-private data. The first time a PKCS#11 logging function is invoked in a thread, it would create a linked list with the first entry and set it in the thread private data as the "first" and "last" (there would be 2 thread-private keys). Upon thread termination, an NSPR thread callback is invoked on the key for nodelist' first entry, which indicates it's time to flush the log, in this case to call the "serialize" function to convert each entry to text and write it to the logfile. This algorithm could also be improved to flush the log periodically for long/large runs (eg. every n node entries, or based on a timestamp), instead of upon thread termination. It would just check for these conditions based on additional environment variables , eg. NSS_DEBUG_PKCS11_PERIOD and NSS_DEBUG_PKCS11_ENTRYLIMIT . When reaching one of these conditions, it would convert the entry to text and log it, and delete all the entries in the list. 4) change all the PR_LOG calls to prepend the name of the module and thread data (only for debug code) . This allows keeping a single logfile for non-performance traces in debug mode 5) add handling for NSS_DEBUG_MODULE="*", which causes the stub PKCS#11 functions to be inserted for all loaded PKCS#11 modules rather than for a specific one 6) add initialization code in NSS_Init* to create the two thread-private keys for nodelist start & end
I have checked in the logging code as I described it in comment #23. Reassigning to wtc.
Assignee: ian.mcgreer → wtc
I'm going to mark this bug fixed in NSS 3.6 for the work that Ian did (described in comment #23 and comment #24). If additional enhancements are desired (such as the ones Julien described in comment #1 and comment #36), please open a new enhancement request.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: