On Windows, when NSS is using the SQL DB and the files are on a local file system, in about one run out of four, it ends up using the "local cache" implementation, which is intended to be used only in cases where the SQL DBs are in a file system to which access is quite slow, such as over NFS or CIFS. The effect of this is a dramatic slowdown. In my tests, which use s rather large cert and key DB pair, the certutil -S command to generate a new self-signed cert takes 1-3 seconds when the local cache is not used, and 39-40 seconds when the local cache is used. That's a increase in time of over 10x for using the "local cache" on the local file system. The cause of this problem has two components. 1) the algorithm used to measure file system access time is flawed (at least on windows). The function sdb_measureAccess always returns one of two values, 1 or 16, returning 16 about one time in 4 or 5. When it returns 16, the local cache is used. I believe this is due to a low resolution clock (~60 hz) and the "fence post" problem. Most of the time the test starts and stops within the same 1/60th of a second, causing the time stamps for begin and end to be the same, showing a zero elapsed time, which sdb_measureAccess converts to a one. But sometimes, it starts in one 1/60th of a second, and ends in the next one, resulting in an apparent elapsed time of 16 milliseconds (1/60th of a second). The problem is that the test duration is much shorter than the resolution of the clock used to measure it. 2) The code that decides whether to use the local cache or not underestimates the cost of the local cache. It assumes that the code with the local cache does 4 times the disk IO of the code that does not use the local cache, and so it multiplies the access time of the local file system by 4 before comparing to the access time in the directory where the DBs reside. But the actual impact of the local cache is greater than 4x, so this 4x multiplier means that, even when the clock resolution time is fixed, the local cache will be used even when there is no advantage to doing so, in cases where the performance times are relatively close. So, I propose to make two changes to try to fix this. 1) Try to fix the problem of clock resolution being coarser than the typical elapsed test time. I propose to change sdb_measureAccess to use PR_Now instead of PR_IntervalNow, in an attempt to improve the clock resoltion. If that doesn't work, I would propose to increase the number of loops in sdb_measureAccess so make the test's elapsed time exceed the clock resolution by at least a factor of 3. 2) I propose to increase the multiplier used in sdb_init from 4 to a higher number, perhaps 10-20, to account for the higher cost of the "local cache".
Switching from PR_IntervalNow to PR_Now didn't help. Evidently PR_Now has higher precision, but the resolution is no higher. The problem is entirely unaffected (on windows) by this change.
Created attachment 350705 [details] call stacks of sdb_updateCache in certutil -S It turns out that, when using the "local cache", the VAST majority of the CPU time is spent in sdb_updateCache. sdb_updateCache discards the old cache and builds a new one. It gets called more than 36 times during the course of a single "certutil -S" command. The attached file contains the stacks of the last 36 calls to it, starting at the point where I broke in to the certutil process with a debugger. Each of the stacks (after the first) in the attached file has a line with an asterisk character on it. That shows where the stack differs from the previous stack. From these results, we can see that sdb_updateCache is called: - 9 times in NSC_GenerateKeyPair (at least 9 times) - 5 times in NSC_GetAttributeValue - 4 times in NSC_SetAttributeValue - 15 times in NSC_FindObjectsInit - 2 times in NSC_CreateObject There is code that attempts to ensure that the cache will not live longer than 10 seconds. However, it gets "updated" (completely rebuilt) SO OFTEN that it never lives close to 10 seconds. Perhaps it would be more efficient if it was not rebuilt so often?
Created attachment 350707 [details] [diff] [review] patch that works on Windows This patch is NOT intended to be committed as is, and so is not marked for review. If I later think we should check it in, I will mark it for review at that time. This patch - uses PR_Now instead of PR_IntervalNow (which didn't help) - increases the loop count in sdb_measureAccess from 200 to 10000, which raises the elapsed time up to a point where it is at least twice as long as the resolution of Windows' clock. - Fixes sdb_getTempDirCallback to look for Windows' directory separator character on Windows. This bug was keeping the measurement of the local cache at zero. - Increases the multiplier for the local cache access time from 4 to 10, although perhaps 30-35 would be more accurate. There are several problems with this patch, including - the loop count in sdb_measureAccess is a constant number of repetitions, rather than a constant amount of time. Maybe the loop should keep trace of elapsed time and count loops until a fixed time has elapsed instead. - There's no need to switch to PR_Now, since the increased precision does not apparently translate into increased resolution.
Created attachment 350710 [details] [diff] [review] patch v2 - solves all issues reported above This patch is far better than my previous one, IMO. Bob, please review. It measures operations in a fixed amount of time, rather than time for a fixed number of operations. It seems to produce much more meaningful results (on Windows) than the old code. Although it attempts to measure for 33 Milliseconds (two Windows' ticks), it actually ends up measuring for 3 ticks. I can see two potential issues with this patch: a) taking 66-96 ms at startup time might be objectionable to the browser folks. b) the 10x multiplier may be debateable :)
The file of stacks that I attached with comment 2 also contains some tiny shell scripts for testing and a few other observations I made while debugging. One of those observations is: sdb_buildCache returns CKR_OK for all errors! Its caller, sdb_updateCache passes the return value from sdb_buildCache up to its caller, sdb_openDBLocal, who utterly ignores it. Bob, wouldn't it be better if a failed attempt to build the cache caused the sdb code to give up on the cache and stop trying to use it? Is there a potential data loss issue because it goes on using an incompletely constructed cache?
Comment on attachment 350710 [details] [diff] [review] patch v2 - solves all issues reported above r+ also note, if someone has problems with the new algorithm choosing the wrong type, you can always force the choice with the environment variable. bob
If this patch adds 66-96 ms to the browser startup time as you said in comment 4, Kai will have a hard time pushing this patch to mozilla-central.
> in reply to comment 7, Maybe PSM should set the environment variable to disable this test.
Also, perhaps we should have platform-dependent tests for this, e.g. have Windows API code to ask if it's a network file system rather than trying to measure it.
The browser already delay loads NSS, so it won't add any time to the startup. bob
Wan-Teh, I was surprised to find that, even when using PR_Now instead of PR_IntervalNow, the values returned were either 0 time apart of 15-16 milliseconds apart. No values were only (say) 1 or 2 milliseconds apart. This reinforces my long-held belief that Windows uses a 60hz clock. But I know that I have read in numerous places that Windows is capable of having its clock have finer resolution than that. Is there something that must be done to enable that finer resolution in Windows?
In reply to comment 7, > If this patch adds 66-96 ms to the browser startup time [...] This would cause initialization of softoken to take that long when cert9.db is in use. It would have no effect for cert8.
One good piece of news. With this bug's patch applied, NSS is MUCH faster with key4.db than with key3.db when using large sets of private keys. I made up pairs of cert and key DBs, one DB pair with "legacy" cert8/key3 DBs and one pair with SQLite cert9/key4 DBs. Each pair of DBs has ~12000 cert/key pairs. A simple certutil -L command takes over 30 MINUTES with the legacy DB pair, but takes only about 4 minutes with SQLite3 DB pair. The slowness of the old DBM-based "legacy" code is due to bug 433105. It is good to know that the SQLite code is not similarly afflicted.
On Windows, PR_Now calls the Win32 function GetSystemTimeAsFileTime: http://msdn.microsoft.com/en-us/library/ms724397(VS.85).aspx I don't know what is the resolution of GetSystemTimeAsFileTime and whether it can be adjusted without privilege.
Checking in sdb.c; new revision: 1.7; previous revision: 1.6 I did some measurements on a Windows box with some network mounted drives. I found that the new code measured 12-13 ops in 33 ms over the net, and over 300 locally. So, the multiplier in this case was more like 25x. A 10x multiplier should still be quite capable of discerning between network and local file systems.