Closed Bug 121975 Opened 23 years ago Closed 2 years ago

Monitors and locks on Windows use unnecessary system calls

Categories

(NSPR :: NSPR, defect, P1)

x86
Windows 2000
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bratell, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [HAS FIX])

Attachments

(1 file, 1 obsolete file)

Right now PR_EnterMonitor and PR_ExitMonitor fetches the current thread which on 
Windows goes into thread local storage and fetches a PRThread pointer. This not 
inexpensive. If the thread pointer is correct PR_Lock is called and the exact 
same PRThread pointer is fetched once again. I hacked a PR_LockForThread 
function which takes the pointer as an argument and that turned out to be ~20% 
faster.

I don't know if this is a Windows-only win but 20% is quite a lot for functions 
that must be thread safe but apart from that doesn't do much. I found this while 
looking at xpconnect overhead. The monitor overhead is as much as 70 percent for 
some functions there that can be called thousands or tens of thousands of times 
on DHTML web pages.

All measurements done with Quantify on a build from CVS today 2002-01-26.
Keywords: perf
I have received at least two reports that NSPR locking
seems slower on Windows 2000 than on Windows NT 4.0.
So maybe it is more expensive to get thread local storage
on Windows 2000?

The optimization you described is simple.  We should
give it a try.
I did try to replace the access of the thread locale storage with the win32 api
call GetCurrentThreadID but that call took almost the exact same time (~1%
faster - maybe). I presume that all the time is spent in the user space ->
kernel space transition so it doesn't matter what the actual call is.

As for differences between different Windows versions, I can't tell. I have only
this one W2k machine to test on here.

To be really fast, we would need to write some assembler
code to read the register that stores the current thread
identifier.  If you want to give that a try, you can
disassemble EnterCriticalSection() and learn from the
assembly code.
I don't think that's the solution. The whole PR_Lock thing is ~10 assembler
instructions long. One call to TlsGetValue and one call to EnterCriticalSection,
and all the time is spent down in those two functions. My suggestion was to
avoid one extra call to TlsGetValue for monitors.

*rewriting everything because NS4 lost it for me, excuse the tersity* 

I started coding just to find that the solution is already there guarded by a 
preprocessor define. If you set the environment variable USE_STATIC_TLS to 
1, NSPR will switch from Tls{Set,Get}Value to _declspec(thread) which doesn't 
cause system calls. You have to delete the nsprpub\WIN32_?.OBJ directory to get 
configure to rerun and recreate the makefiles.

   cd mozilla
   rm -rf nsprpub\WIN32_?.OBJ
   SET USE_STATIC_TLS=1
   nmake -f build_nspr

The result is that PR_EnterMonitor is ~50% faster and PR_Lock is ~25% faster. I 
run the getElementById test in bug 118933 and the times there dropped from 
~1150ms to ~1110ms. 

There is a minor downside though. According to Microsoft the _declspec(thread) 
way is not guaranteed to work if used in an DLL which is loaded manually with 
LoadLibrary. This is not the case for Mozilla and nspr4.dll (which is loaded 
automatically) but it's best to know that.

cc:ing some people who might be interested in lock performance.

jrgm, could you run the page load tests and see if they are affected by this? I 
don't know if page loads uses very much locking, with the layout engine being 
single threaded. It's mostly in DOM and JS/xpconnect code I've seen it and there 
locking can be 10-20% or even more of the time so that code will benefit.
Keywords: mozilla0.9.9
Summary: Monitors (PR_EnterMonitor and PR_ExitMonitor) can be 20% faster → Monitors and locks can on Windows can be substantially faster
Stacks, we need stacks -- who is calling PR_EnterMonitor or PR_Lock with high
frequency?  PR_Lock and PR_Unlock I can believe, from Necko and XPCOM code that
has to deal with multiple threads contending for shared data.  But since the big
change for bug 54743 went in, JS should not be using NSPR conditions or locks
for most, if not all, JS objects in Mozilla (because most objects are used by
only one thread, ever).  How is XPConnect in the picture?

/be
I was looking at DHTML performance when I ended up in lock speed. For 10000
calls to getElementById (think DHTML animation) where we are ~5 times slower
than MSIE (see bug 118933) we have the following locking:

44000 calls to PR_Lock from
js_AddRootRT()
js_AddNamedRootRT()
XPCWrappedNative::AddRef()

The calls to AddRef is done by by 
nsScriptSecurityManager::GetObjectPrincipal
XPCWrappedNative::GetNewOrUsed
XPCWrappedNative::QueryInterface

65000 calls to PR_Lock from PR_EnterMonitor. Those are done by
  22000 calls by XPCNativeInterface::GetNewOrUsed(... nsID)
                 XPCConvert::NativeInterface2JSObject
                 XPCWrappedNative::CallMethod
                  ....
   6000 calls by nsComponentManagerImpl::GetService
  23000 calls by XPCWrappedNative::GetNewOrUsed(... XPCWrappedNative**)
                 XPCConvert::NativeInterface2JSObject
                 XPCWrappedNative::CallMethod
                  ...


43000 calls to PR_Lock from 
JS_RemoveRoot
JS_RemoveRootRT
XPCWrappedNative::Release
nsCOMPtr_base::~nsCOMPtr_base(void) (and from here the stack is useless)
   probably the next is:
XPCConvert::NativeData2JS


If you want it short, it's the conversion between JS and C++ (xpconnect) that
does most of the locking and almost all of the monitor use. Especially data
conversion but also XPCConvert::NativeInterface2JSObject.


But no comment on using static tls for Windows? Regardless of how many calls
that is made to PR_Lock, we have nothing to lose by having as little overhead as
possible?

Also, there is a typo in my description above.
to build nspr: 
 nmake -f client.mak build_nspr


Am I reading this wrong, or is it true that USE_STATIC_TLS only has effect on
win95-based OS versions, while winnt-based versions decide the issue at
runtime. See:
  http://lxr.mozilla.org/mozilla/ident?i=_pr_use_static_tls
  http://lxr.mozilla.org/mozilla/search?string=USE_STATIC_TLS

Mozilla uses (and builds) NSPR for Win95 regardless of build machine. I guess
the NT version is a version only used by some Netscape server products so I
never looked very much at that code.
It is not hard to modify the "WIN95" version of NSPR
so that it uses static TLS automatically whenever it
can.

However, I think you will get bigger gain by reducing
the number of PR_EnterMonitor or PR_Lock calls.
I don't see it as one or another. I see it as both reducing the number of calls
and making the calls as fast as possible. There are other bugs about the number
of calls. Let's this one be about that we should use static tls so the calls are
as fast as possible. As far as I'm concerned it's so easy (the build flag
change) that it should be done today.
question.  The 'minor downside' you mentioned worries me a bit.  How does this
effect all of our component libraries which are loaded via LoadLibrary().  Could
you point me at the documention you are referencing?
(Ah, right. Forgot about NSPR building as win95).

Anyways, I pulled a new tree at ~9pm Jan 27 and updated a current one
to the same time (with no other changes to it). In the new tree, I did
'SET USE_STATIC_TLS=1'. With those builds, I tested the getElementById
test mentioned above, the pageload test, a window opening test and a
startup test.

Sorry to say, but I am not seeing a significant difference in the
results for the two different builds for these tests, including for the 
getElementById test. I can't explain why I can't reproduce your result 
for this test. (I did check that, at least, I have different build sizes 
for nspr4.dll. Not sure what else to check to make sure this is truly in
my build.)
The time we spend in the locking code in any of the DOM tests might still be
dwarfed by the time we spend in the security manager code. Please test again
with the patch in bug 119646, of wait until that patch has landed...
Strange. I'm definitely seeing a difference in the getElementById test. I reran
the tests several times to make sure and the gain was ~5% wall clock time all
the time. I haven't applied the CanAccess patch. My computer is a W2kSP2
computer. wtc mentioned that locking has seen slower on W2k. Could it be that
the tls{get,set}value calls only cost much there and you tested with some other OS? 

Otherwise, I can't explain why you don't see anything. You could put garbage
inside the "wrong" defines in w95thred.c to verify that the correct code is built.

dougt, read the final paragraph at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HTML/_core_rules_and_limitations_for_tls.asp

I've been running with the change since the day before yesterday and haven't had
a single problem with it.
I guess I do see a difference in the two runs. One thing was that I was
running this test the other night with a file:// url, where (a) it's slightly
faster than http [how come?], and (b) the difference is not so clear.

via http:// url:                          via file:// url:                      
    
A) Avg (10runs) : 2614ms / 2594 / 2625    Avg (10runs) : 2615ms / 2593 / 2640
B) Avg (10runs) : 2667ms / 2656 / 2672    Avg (10runs) : 2634ms / 2610 / 2657
     about 2% faster on average		       about 0.7% faster on average     
    

(A) is with USE_STATIC_TLS defined; (B) is with it undefined (both current
trunk)

The above runs are "typical", but I did get one run with the file:// url and
a (A) build that was ~2600 on average, and the absolute max and min of the
ranges would be larger than shown above. (That is, I discounted the
difference in the results that I saw last night when I saw the ranges
overlapping so much for the file:// tests; but with http:// tests, it is a
clearer, but narrow separation).

Note: I am running win2k "gold" [no service packs] on 500MHz/128MB.

I tried using the patch from bug 119646, but it actually doesn't build; I 
took a flail at fixing the bustage, but then when it built it crashed on the
test :-(. Oh, well.
> Avg (10runs) : 2615ms / 2593 / 2640

Note, I modified the test's reporting to showing min and max values, which is 
what is shown in the second and third time fields above.
USE_STATIC_TLS is a developer option to allow an
NSPR developer to experiment with different
implementation strategies.  Mozilla should not
build NSPR with this option.

If you want static TLS, we can modify the "WIN95"
version of NSPR to detect at run time whether it
is safe to use static TLS.  We already do this in
the "WINNT" version of NSPR.
Here is the dynamic solution almost copied straight from the NT code. Now
thread local storage will be used unless nspr.dll is loaded dynamically with an
explicit LoadLibrary call. 

As mentioned earlier this makes locking with PR_Lock ~25% faster on Windows
2000 and PR_Monitor ~50% faster. This should affect xpconnect code the most
since it uses locks all the time. The trivial getElementById function was ~3-4%
faster by using thread local storage instead of making system calls for
locking.

I would very much like to have this in Mozilla 1.0 since it's such a trivial
change (copied from existing, proven code) with measurable effect.
wtc, can you review this? I don't know the procedure to get patches into NSPR so
if I'm doing it the wrong way please tell me.
Keywords: mozilla1.0, patch
Summary: Monitors and locks can on Windows can be substantially faster → Monitors and locks on Windows use unnecessary system calls
Whiteboard: [HAS FIX]
Priority: -- → P1
Target Milestone: --- → 4.2
Blocks: 133659
Comment on attachment 76047 [details] [diff] [review]
Patch to switch strategy dynamically

This patch is correct except for one bug in w95dllmain.c.

>@@ -53,6 +55,15 @@
>         case DLL_PROCESS_ATTACH:
>             break;
>         case DLL_THREAD_ATTACH:
>+            /*
>+             * If lpvReserved is NULL, we are dynamically loaded
>+             * and therefore can't use static thread-local storage.
>+             */
>+            if (lpvReserved == NULL) {
>+                _pr_use_static_tls = FALSE;
>+            } else {
>+                _pr_use_static_tls = TRUE;
>+            }
>             break;
>         case DLL_THREAD_DETACH:
>             if (_pr_initialized) {

The new code should be added under the DLL_PROCESS_ATTACH case.
I will attach a new patch.
Attachment #76047 - Flags: needs-work+
I fixed the minor bug in DllMain.  I also adjusted the
comments and white space to match the "WINNT" version
more closely.
Attachment #76047 - Attachment is obsolete: true
Patch checked into the tip of NSPR.  I will check it
in on the NSPRPUB_PRE_4_2_CLIENT_BRANCH as well.

Marked the bug fixed.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Part of the "minor downside" of using static TLS is bug 135570 -- inability to
install on all Win95 and scattered Win98 systems.
So how do we fix bug 135570? Give up on the perf wins and back this out, or ship
an extra copy of nspr4.dll along with the installer for use on Win95 systems?
I *really* doubt that we will trade a 2% speed win for a download size win.  
It's not just the size, someone's got to figure out how to coerce the nspr build
system to build two versions of nspr4.dll, and then change the install code to
figure out how to use them both. Not rocket science, but not so easy it'll get
done by tomorrow's deadline as compared to backing this out so we can ship a
beta on Win95.
In light of Microsoft Knowledge Base article Q118816
(http://support.microsoft.com/default.aspx?scid=kb;EN-US;q118816),
the patch should be backed out and this bug should be resolved
WONTFIX as long as we need to support Windows 95.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
the pain!

can we think of any other way to support win95 while maintaining the performance
improvements that this fix gains us?
dougt: http://bugzilla.mozilla.org/show_bug.cgi?id=135570#c68
If I understand the problem correctly as long as the exe in question links to
nspr4.dll then the TLS will be happy - even if a later attempt is made to load
nspr4.dll (directly or indirectly) via LoadLibrary. Whether or not this is an
acceptable solution is debatable.
Doug, John,

NSPR needs to support more than the Mozilla client and its
installer.  This patch breaks any NSPR client that loads
nspr4.dll with a LoadLibrary call on Windows 95.  As long
as NSPR needs to support Windows 95, this is not acceptable.
If we back out the patch it looks like this is controlled by a compiler flag. We
could tickle the build scripts to produce two versions of the nspr library on
windows. Mozilla could use the faster version, and the original version could be
used by the installer and any other embedding client that dynamically loads the
library.

Once we create the Mozilla Runtime Environment that'll mean all embeddors,
right?. Apps will look up the centrally-installed correct version of the MRE in
the registry and use LoadLibrary() to get it going.

The first step is to back out this patch and send the performance gurus back to
the drawing board on how to get the faster second version out of the build system.
FWIW sr=jband on backing out the change.
a=dbaron on backing out on 1.0 branch
Blocks: 135570
The fix has been backed out of the tip, NSPRPUB_PRE_4_2_CLIENT_BRANCH,
and MOZILLA_1_0_0_BRANCH of NSPR.  Marked the bug WONTFIX.

The fix can be checked back in when we don't need to support Windows 95.
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → WONTFIX
Reopening. The issue is still there. 

Now that we know that the way recommended in comment 18 breaks Win95, I would
like to try the "Two types of NSPR"-way with the USE_STATIC_TLS compiler flag as
was dicussed at the start. This really affects lock heavy code. 

And even if we don't want to change that, there's still the issue of
PR_{Enter,Exit}Monitor calling TlsGet twice to get the same value. Once in its
own function and once in PR_Lock.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
QA Contact: wtchang → nspr
The target milestone is already released. Resetting target milestone.
Target Milestone: 4.2 → ---
(In reply to comment #35)
> The fix has been backed out of the tip, NSPRPUB_PRE_4_2_CLIENT_BRANCH,
> and MOZILLA_1_0_0_BRANCH of NSPR.  Marked the bug WONTFIX.
> 
> The fix can be checked back in when we don't need to support Windows 95.

now that win9x is dead...
Status: REOPENED → NEW

The bug assignee didn't login in Bugzilla in the last 7 months.
:KaiE, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: wtc → nobody
Flags: needinfo?(kaie)
Status: NEW → RESOLVED
Closed: 22 years ago2 years ago
Flags: needinfo?(kaie)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: