On Windows, when NSS is using the SQL DB and the files are on a local
file system, in about one run out of four, it ends up using the
"local cache" implementation, which is intended to be used only in cases
where the SQL DBs are in a file system to which access is quite slow,
such as over NFS or CIFS.
The effect of this is a dramatic slowdown. In my tests, which use s rather
large cert and key DB pair, the certutil -S command to generate a new self-signed cert takes 1-3 seconds when the local cache is not used, and
39-40 seconds when the local cache is used. That's a increase in time of
over 10x for using the "local cache" on the local file system.
The cause of this problem has two components.
1) the algorithm used to measure file system access time is flawed (at
least on windows). The function sdb_measureAccess always returns one of
two values, 1 or 16, returning 16 about one time in 4 or 5. When it
returns 16, the local cache is used. I believe this is due to a low
resolution clock (~60 hz) and the "fence post" problem. Most of the
time the test starts and stops within the same 1/60th of a second, causing
the time stamps for begin and end to be the same, showing a zero elapsed
time, which sdb_measureAccess converts to a one. But sometimes, it starts
in one 1/60th of a second, and ends in the next one, resulting in an apparent
elapsed time of 16 milliseconds (1/60th of a second). The problem is that
the test duration is much shorter than the resolution of the clock used to
2) The code that decides whether to use the local cache or not underestimates
the cost of the local cache. It assumes that the code with the local cache
does 4 times the disk IO of the code that does not use the local cache, and
so it multiplies the access time of the local file system by 4 before
comparing to the access time in the directory where the DBs reside. But
the actual impact of the local cache is greater than 4x, so this 4x
multiplier means that, even when the clock resolution time is fixed, the
local cache will be used even when there is no advantage to doing so, in
cases where the performance times are relatively close.
So, I propose to make two changes to try to fix this.
1) Try to fix the problem of clock resolution being coarser than the
typical elapsed test time. I propose to change sdb_measureAccess to use
PR_Now instead of PR_IntervalNow, in an attempt to improve the clock resoltion. If that doesn't work, I would propose to increase the number
of loops in sdb_measureAccess so make the test's elapsed time exceed the
clock resolution by at least a factor of 3.
2) I propose to increase the multiplier used in sdb_init from 4 to a higher
number, perhaps 10-20, to account for the higher cost of the "local cache".
Switching from PR_IntervalNow to PR_Now didn't help.
Evidently PR_Now has higher precision, but the resolution is no higher.
The problem is entirely unaffected (on windows) by this change.
Created attachment 350705 [details]
call stacks of sdb_updateCache in certutil -S
It turns out that, when using the "local cache", the VAST majority of the
CPU time is spent in sdb_updateCache. sdb_updateCache discards the old
cache and builds a new one. It gets called more than 36 times during the
course of a single "certutil -S" command. The attached file contains the
stacks of the last 36 calls to it, starting at the point where I broke in
to the certutil process with a debugger.
Each of the stacks (after the first) in the attached file has a line with
an asterisk character on it. That shows where the stack differs from the
previous stack. From these results, we can see that sdb_updateCache is
- 9 times in NSC_GenerateKeyPair (at least 9 times)
- 5 times in NSC_GetAttributeValue
- 4 times in NSC_SetAttributeValue
- 15 times in NSC_FindObjectsInit
- 2 times in NSC_CreateObject
There is code that attempts to ensure that the cache will not live longer
than 10 seconds. However, it gets "updated" (completely rebuilt) SO OFTEN
that it never lives close to 10 seconds. Perhaps it would be more efficient
if it was not rebuilt so often?
Created attachment 350707 [details] [diff] [review]
patch that works on Windows
This patch is NOT intended to be committed as is, and so is not marked for
review. If I later think we should check it in, I will mark it for
review at that time.
- uses PR_Now instead of PR_IntervalNow (which didn't help)
- increases the loop count in sdb_measureAccess from 200 to 10000, which
raises the elapsed time up to a point where it is at least twice as long
as the resolution of Windows' clock.
- Fixes sdb_getTempDirCallback to look for Windows' directory separator
character on Windows. This bug was keeping the measurement of the local
cache at zero.
- Increases the multiplier for the local cache access time from 4 to 10,
although perhaps 30-35 would be more accurate.
There are several problems with this patch, including
- the loop count in sdb_measureAccess is a constant number of repetitions,
rather than a constant amount of time. Maybe the loop should keep trace of
elapsed time and count loops until a fixed time has elapsed instead.
- There's no need to switch to PR_Now, since the increased precision does
not apparently translate into increased resolution.
Created attachment 350710 [details] [diff] [review]
patch v2 - solves all issues reported above
This patch is far better than my previous one, IMO. Bob, please review.
It measures operations in a fixed amount of time, rather than time for a
fixed number of operations. It seems to produce much more meaningful
results (on Windows) than the old code.
Although it attempts to measure for 33 Milliseconds (two Windows' ticks),
it actually ends up measuring for 3 ticks.
I can see two potential issues with this patch:
a) taking 66-96 ms at startup time might be objectionable to the browser
b) the 10x multiplier may be debateable :)
The file of stacks that I attached with comment 2 also contains some tiny
shell scripts for testing and a few other observations I made while
One of those observations is:
sdb_buildCache returns CKR_OK for all errors!
Its caller, sdb_updateCache passes the return value from sdb_buildCache
up to its caller, sdb_openDBLocal, who utterly ignores it.
Bob, wouldn't it be better if a failed attempt to build the cache caused
the sdb code to give up on the cache and stop trying to use it?
Is there a potential data loss issue because it goes on using an
incompletely constructed cache?
Comment on attachment 350710 [details] [diff] [review]
patch v2 - solves all issues reported above
also note, if someone has problems with the new algorithm choosing the wrong type, you can always force the choice with the environment variable.
If this patch adds 66-96 ms to the browser startup time as
you said in comment 4, Kai will have a hard time pushing this
patch to mozilla-central.
> in reply to comment 7,
Maybe PSM should set the environment variable to disable this test.
Also, perhaps we should have platform-dependent tests for this, e.g.
have Windows API code to ask if it's a network file system rather than
trying to measure it.
The browser already delay loads NSS, so it won't add any time to the startup.
Wan-Teh, I was surprised to find that, even when using PR_Now instead of
PR_IntervalNow, the values returned were either 0 time apart of 15-16
milliseconds apart. No values were only (say) 1 or 2 milliseconds apart.
This reinforces my long-held belief that Windows uses a 60hz clock.
But I know that I have read in numerous places that Windows is capable
of having its clock have finer resolution than that. Is there something
that must be done to enable that finer resolution in Windows?
In reply to comment 7,
> If this patch adds 66-96 ms to the browser startup time [...]
This would cause initialization of softoken to take that long when
cert9.db is in use. It would have no effect for cert8.
One good piece of news. With this bug's patch applied, NSS is MUCH faster
with key4.db than with key3.db when using large sets of private keys.
I made up pairs of cert and key DBs, one DB pair with "legacy" cert8/key3
DBs and one pair with SQLite cert9/key4 DBs. Each pair of DBs has ~12000
cert/key pairs. A simple certutil -L command takes over 30 MINUTES with
the legacy DB pair, but takes only about 4 minutes with SQLite3 DB pair.
The slowness of the old DBM-based "legacy" code is due to bug 433105.
It is good to know that the SQLite code is not similarly afflicted.
On Windows, PR_Now calls the Win32 function GetSystemTimeAsFileTime:
I don't know what is the resolution of GetSystemTimeAsFileTime
and whether it can be adjusted without privilege.
Checking in sdb.c; new revision: 1.7; previous revision: 1.6
I did some measurements on a Windows box with some network mounted drives.
I found that the new code measured 12-13 ops in 33 ms over the net, and
over 300 locally. So, the multiplier in this case was more like 25x.
A 10x multiplier should still be quite capable of discerning between
network and local file systems.