Closed
Bug 121975
Opened 24 years ago
Closed 3 years ago
Monitors and locks on Windows use unnecessary system calls
Categories
(NSPR :: NSPR, defect, P1)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: bratell, Unassigned)
References
Details
(Keywords: perf, Whiteboard: [HAS FIX])
Attachments
(1 file, 1 obsolete file)
8.00 KB,
patch
|
Details | Diff | Splinter Review |
Right now PR_EnterMonitor and PR_ExitMonitor fetches the current thread which on
Windows goes into thread local storage and fetches a PRThread pointer. This not
inexpensive. If the thread pointer is correct PR_Lock is called and the exact
same PRThread pointer is fetched once again. I hacked a PR_LockForThread
function which takes the pointer as an argument and that turned out to be ~20%
faster.
I don't know if this is a Windows-only win but 20% is quite a lot for functions
that must be thread safe but apart from that doesn't do much. I found this while
looking at xpconnect overhead. The monitor overhead is as much as 70 percent for
some functions there that can be called thousands or tens of thousands of times
on DHTML web pages.
All measurements done with Quantify on a build from CVS today 2002-01-26.
Comment 1•24 years ago
|
||
I have received at least two reports that NSPR locking
seems slower on Windows 2000 than on Windows NT 4.0.
So maybe it is more expensive to get thread local storage
on Windows 2000?
The optimization you described is simple. We should
give it a try.
Reporter | ||
Comment 2•24 years ago
|
||
I did try to replace the access of the thread locale storage with the win32 api
call GetCurrentThreadID but that call took almost the exact same time (~1%
faster - maybe). I presume that all the time is spent in the user space ->
kernel space transition so it doesn't matter what the actual call is.
As for differences between different Windows versions, I can't tell. I have only
this one W2k machine to test on here.
Comment 3•24 years ago
|
||
To be really fast, we would need to write some assembler
code to read the register that stores the current thread
identifier. If you want to give that a try, you can
disassemble EnterCriticalSection() and learn from the
assembly code.
Reporter | ||
Comment 4•24 years ago
|
||
I don't think that's the solution. The whole PR_Lock thing is ~10 assembler
instructions long. One call to TlsGetValue and one call to EnterCriticalSection,
and all the time is spent down in those two functions. My suggestion was to
avoid one extra call to TlsGetValue for monitors.
Reporter | ||
Comment 5•24 years ago
|
||
*rewriting everything because NS4 lost it for me, excuse the tersity*
I started coding just to find that the solution is already there guarded by a
preprocessor define. If you set the environment variable USE_STATIC_TLS to
1, NSPR will switch from Tls{Set,Get}Value to _declspec(thread) which doesn't
cause system calls. You have to delete the nsprpub\WIN32_?.OBJ directory to get
configure to rerun and recreate the makefiles.
cd mozilla
rm -rf nsprpub\WIN32_?.OBJ
SET USE_STATIC_TLS=1
nmake -f build_nspr
The result is that PR_EnterMonitor is ~50% faster and PR_Lock is ~25% faster. I
run the getElementById test in bug 118933 and the times there dropped from
~1150ms to ~1110ms.
There is a minor downside though. According to Microsoft the _declspec(thread)
way is not guaranteed to work if used in an DLL which is loaded manually with
LoadLibrary. This is not the case for Mozilla and nspr4.dll (which is loaded
automatically) but it's best to know that.
cc:ing some people who might be interested in lock performance.
jrgm, could you run the page load tests and see if they are affected by this? I
don't know if page loads uses very much locking, with the layout engine being
single threaded. It's mostly in DOM and JS/xpconnect code I've seen it and there
locking can be 10-20% or even more of the time so that code will benefit.
Keywords: mozilla0.9.9
Summary: Monitors (PR_EnterMonitor and PR_ExitMonitor) can be 20% faster → Monitors and locks can on Windows can be substantially faster
Comment 6•24 years ago
|
||
Stacks, we need stacks -- who is calling PR_EnterMonitor or PR_Lock with high
frequency? PR_Lock and PR_Unlock I can believe, from Necko and XPCOM code that
has to deal with multiple threads contending for shared data. But since the big
change for bug 54743 went in, JS should not be using NSPR conditions or locks
for most, if not all, JS objects in Mozilla (because most objects are used by
only one thread, ever). How is XPConnect in the picture?
/be
Reporter | ||
Comment 7•24 years ago
|
||
I was looking at DHTML performance when I ended up in lock speed. For 10000
calls to getElementById (think DHTML animation) where we are ~5 times slower
than MSIE (see bug 118933) we have the following locking:
44000 calls to PR_Lock from
js_AddRootRT()
js_AddNamedRootRT()
XPCWrappedNative::AddRef()
The calls to AddRef is done by by
nsScriptSecurityManager::GetObjectPrincipal
XPCWrappedNative::GetNewOrUsed
XPCWrappedNative::QueryInterface
65000 calls to PR_Lock from PR_EnterMonitor. Those are done by
22000 calls by XPCNativeInterface::GetNewOrUsed(... nsID)
XPCConvert::NativeInterface2JSObject
XPCWrappedNative::CallMethod
....
6000 calls by nsComponentManagerImpl::GetService
23000 calls by XPCWrappedNative::GetNewOrUsed(... XPCWrappedNative**)
XPCConvert::NativeInterface2JSObject
XPCWrappedNative::CallMethod
...
43000 calls to PR_Lock from
JS_RemoveRoot
JS_RemoveRootRT
XPCWrappedNative::Release
nsCOMPtr_base::~nsCOMPtr_base(void) (and from here the stack is useless)
probably the next is:
XPCConvert::NativeData2JS
If you want it short, it's the conversion between JS and C++ (xpconnect) that
does most of the locking and almost all of the monitor use. Especially data
conversion but also XPCConvert::NativeInterface2JSObject.
But no comment on using static tls for Windows? Regardless of how many calls
that is made to PR_Lock, we have nothing to lose by having as little overhead as
possible?
Also, there is a typo in my description above.
to build nspr:
nmake -f client.mak build_nspr
Comment 8•24 years ago
|
||
Am I reading this wrong, or is it true that USE_STATIC_TLS only has effect on
win95-based OS versions, while winnt-based versions decide the issue at
runtime. See:
http://lxr.mozilla.org/mozilla/ident?i=_pr_use_static_tls
http://lxr.mozilla.org/mozilla/search?string=USE_STATIC_TLS
Reporter | ||
Comment 9•24 years ago
|
||
Mozilla uses (and builds) NSPR for Win95 regardless of build machine. I guess
the NT version is a version only used by some Netscape server products so I
never looked very much at that code.
Comment 10•24 years ago
|
||
It is not hard to modify the "WIN95" version of NSPR
so that it uses static TLS automatically whenever it
can.
However, I think you will get bigger gain by reducing
the number of PR_EnterMonitor or PR_Lock calls.
Reporter | ||
Comment 11•24 years ago
|
||
I don't see it as one or another. I see it as both reducing the number of calls
and making the calls as fast as possible. There are other bugs about the number
of calls. Let's this one be about that we should use static tls so the calls are
as fast as possible. As far as I'm concerned it's so easy (the build flag
change) that it should be done today.
Comment 12•24 years ago
|
||
question. The 'minor downside' you mentioned worries me a bit. How does this
effect all of our component libraries which are loaded via LoadLibrary(). Could
you point me at the documention you are referencing?
Comment 13•24 years ago
|
||
(Ah, right. Forgot about NSPR building as win95).
Anyways, I pulled a new tree at ~9pm Jan 27 and updated a current one
to the same time (with no other changes to it). In the new tree, I did
'SET USE_STATIC_TLS=1'. With those builds, I tested the getElementById
test mentioned above, the pageload test, a window opening test and a
startup test.
Sorry to say, but I am not seeing a significant difference in the
results for the two different builds for these tests, including for the
getElementById test. I can't explain why I can't reproduce your result
for this test. (I did check that, at least, I have different build sizes
for nspr4.dll. Not sure what else to check to make sure this is truly in
my build.)
Comment 14•24 years ago
|
||
The time we spend in the locking code in any of the DOM tests might still be
dwarfed by the time we spend in the security manager code. Please test again
with the patch in bug 119646, of wait until that patch has landed...
Reporter | ||
Comment 15•24 years ago
|
||
Strange. I'm definitely seeing a difference in the getElementById test. I reran
the tests several times to make sure and the gain was ~5% wall clock time all
the time. I haven't applied the CanAccess patch. My computer is a W2kSP2
computer. wtc mentioned that locking has seen slower on W2k. Could it be that
the tls{get,set}value calls only cost much there and you tested with some other OS?
Otherwise, I can't explain why you don't see anything. You could put garbage
inside the "wrong" defines in w95thred.c to verify that the correct code is built.
dougt, read the final paragraph at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HTML/_core_rules_and_limitations_for_tls.asp
I've been running with the change since the day before yesterday and haven't had
a single problem with it.
Comment 16•24 years ago
|
||
I guess I do see a difference in the two runs. One thing was that I was
running this test the other night with a file:// url, where (a) it's slightly
faster than http [how come?], and (b) the difference is not so clear.
via http:// url: via file:// url:
A) Avg (10runs) : 2614ms / 2594 / 2625 Avg (10runs) : 2615ms / 2593 / 2640
B) Avg (10runs) : 2667ms / 2656 / 2672 Avg (10runs) : 2634ms / 2610 / 2657
about 2% faster on average about 0.7% faster on average
(A) is with USE_STATIC_TLS defined; (B) is with it undefined (both current
trunk)
The above runs are "typical", but I did get one run with the file:// url and
a (A) build that was ~2600 on average, and the absolute max and min of the
ranges would be larger than shown above. (That is, I discounted the
difference in the results that I saw last night when I saw the ranges
overlapping so much for the file:// tests; but with http:// tests, it is a
clearer, but narrow separation).
Note: I am running win2k "gold" [no service packs] on 500MHz/128MB.
I tried using the patch from bug 119646, but it actually doesn't build; I
took a flail at fixing the bustage, but then when it built it crashed on the
test :-(. Oh, well.
Comment 17•24 years ago
|
||
> Avg (10runs) : 2615ms / 2593 / 2640
Note, I modified the test's reporting to showing min and max values, which is
what is shown in the second and third time fields above.
Comment 18•24 years ago
|
||
USE_STATIC_TLS is a developer option to allow an
NSPR developer to experiment with different
implementation strategies. Mozilla should not
build NSPR with this option.
If you want static TLS, we can modify the "WIN95"
version of NSPR to detect at run time whether it
is safe to use static TLS. We already do this in
the "WINNT" version of NSPR.
Reporter | ||
Comment 19•23 years ago
|
||
Here is the dynamic solution almost copied straight from the NT code. Now
thread local storage will be used unless nspr.dll is loaded dynamically with an
explicit LoadLibrary call.
As mentioned earlier this makes locking with PR_Lock ~25% faster on Windows
2000 and PR_Monitor ~50% faster. This should affect xpconnect code the most
since it uses locks all the time. The trivial getElementById function was ~3-4%
faster by using thread local storage instead of making system calls for
locking.
I would very much like to have this in Mozilla 1.0 since it's such a trivial
change (copied from existing, proven code) with measurable effect.
Reporter | ||
Comment 20•23 years ago
|
||
wtc, can you review this? I don't know the procedure to get patches into NSPR so
if I'm doing it the wrong way please tell me.
Keywords: mozilla1.0,
patch
Summary: Monitors and locks can on Windows can be substantially faster → Monitors and locks on Windows use unnecessary system calls
Whiteboard: [HAS FIX]
Updated•23 years ago
|
Priority: -- → P1
Target Milestone: --- → 4.2
Comment 21•23 years ago
|
||
Comment on attachment 76047 [details] [diff] [review]
Patch to switch strategy dynamically
This patch is correct except for one bug in w95dllmain.c.
>@@ -53,6 +55,15 @@
> case DLL_PROCESS_ATTACH:
> break;
> case DLL_THREAD_ATTACH:
>+ /*
>+ * If lpvReserved is NULL, we are dynamically loaded
>+ * and therefore can't use static thread-local storage.
>+ */
>+ if (lpvReserved == NULL) {
>+ _pr_use_static_tls = FALSE;
>+ } else {
>+ _pr_use_static_tls = TRUE;
>+ }
> break;
> case DLL_THREAD_DETACH:
> if (_pr_initialized) {
The new code should be added under the DLL_PROCESS_ATTACH case.
I will attach a new patch.
Attachment #76047 -
Flags: needs-work+
Comment 22•23 years ago
|
||
I fixed the minor bug in DllMain. I also adjusted the
comments and white space to match the "WINNT" version
more closely.
Attachment #76047 -
Attachment is obsolete: true
Comment 23•23 years ago
|
||
Patch checked into the tip of NSPR. I will check it
in on the NSPRPUB_PRE_4_2_CLIENT_BRANCH as well.
Marked the bug fixed.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Comment 24•23 years ago
|
||
Part of the "minor downside" of using static TLS is bug 135570 -- inability to
install on all Win95 and scattered Win98 systems.
Comment 25•23 years ago
|
||
So how do we fix bug 135570? Give up on the perf wins and back this out, or ship
an extra copy of nspr4.dll along with the installer for use on Win95 systems?
Comment 26•23 years ago
|
||
I *really* doubt that we will trade a 2% speed win for a download size win.
Comment 27•23 years ago
|
||
It's not just the size, someone's got to figure out how to coerce the nspr build
system to build two versions of nspr4.dll, and then change the install code to
figure out how to use them both. Not rocket science, but not so easy it'll get
done by tomorrow's deadline as compared to backing this out so we can ship a
beta on Win95.
Comment 28•23 years ago
|
||
In light of Microsoft Knowledge Base article Q118816
(http://support.microsoft.com/default.aspx?scid=kb;EN-US;q118816),
the patch should be backed out and this bug should be resolved
WONTFIX as long as we need to support Windows 95.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 29•23 years ago
|
||
the pain!
can we think of any other way to support win95 while maintaining the performance
improvements that this fix gains us?
Comment 30•23 years ago
|
||
dougt: http://bugzilla.mozilla.org/show_bug.cgi?id=135570#c68
If I understand the problem correctly as long as the exe in question links to
nspr4.dll then the TLS will be happy - even if a later attempt is made to load
nspr4.dll (directly or indirectly) via LoadLibrary. Whether or not this is an
acceptable solution is debatable.
Comment 31•23 years ago
|
||
Doug, John,
NSPR needs to support more than the Mozilla client and its
installer. This patch breaks any NSPR client that loads
nspr4.dll with a LoadLibrary call on Windows 95. As long
as NSPR needs to support Windows 95, this is not acceptable.
Comment 32•23 years ago
|
||
If we back out the patch it looks like this is controlled by a compiler flag. We
could tickle the build scripts to produce two versions of the nspr library on
windows. Mozilla could use the faster version, and the original version could be
used by the installer and any other embedding client that dynamically loads the
library.
Once we create the Mozilla Runtime Environment that'll mean all embeddors,
right?. Apps will look up the centrally-installed correct version of the MRE in
the registry and use LoadLibrary() to get it going.
The first step is to back out this patch and send the performance gurus back to
the drawing board on how to get the faster second version out of the build system.
Comment 33•23 years ago
|
||
FWIW sr=jband on backing out the change.
a=dbaron on backing out on 1.0 branch
Comment 35•23 years ago
|
||
The fix has been backed out of the tip, NSPRPUB_PRE_4_2_CLIENT_BRANCH,
and MOZILLA_1_0_0_BRANCH of NSPR. Marked the bug WONTFIX.
The fix can be checked back in when we don't need to support Windows 95.
Status: REOPENED → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → WONTFIX
Reporter | ||
Comment 36•23 years ago
|
||
Reopening. The issue is still there.
Now that we know that the way recommended in comment 18 breaks Win95, I would
like to try the "Two types of NSPR"-way with the USE_STATIC_TLS compiler flag as
was dicussed at the start. This really affects lock heavy code.
And even if we don't want to change that, there's still the issue of
PR_{Enter,Exit}Monitor calling TlsGet twice to get the same value. Once in its
own function and once in PR_Lock.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Updated•19 years ago
|
QA Contact: wtchang → nspr
Comment 37•18 years ago
|
||
The target milestone is already released. Resetting target milestone.
Target Milestone: 4.2 → ---
Comment 38•17 years ago
|
||
(In reply to comment #35)
> The fix has been backed out of the tip, NSPRPUB_PRE_4_2_CLIENT_BRANCH,
> and MOZILLA_1_0_0_BRANCH of NSPR. Marked the bug WONTFIX.
>
> The fix can be checked back in when we don't need to support Windows 95.
now that win9x is dead...
Updated•17 years ago
|
Status: REOPENED → NEW
Comment 39•3 years ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months.
:KaiE, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee: wtc → nobody
Flags: needinfo?(kaie)
Updated•3 years ago
|
Status: NEW → RESOLVED
Closed: 23 years ago → 3 years ago
Flags: needinfo?(kaie)
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•