Closed Bug 121975 Opened 24 years ago Closed 3 years ago

Monitors and locks on Windows use unnecessary system calls

Categories

(NSPR :: NSPR, defect, P1)

x86
Windows 2000
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bratell, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [HAS FIX])

Attachments

(1 file, 1 obsolete file)

Right now PR_EnterMonitor and PR_ExitMonitor fetches the current thread which on Windows goes into thread local storage and fetches a PRThread pointer. This not inexpensive. If the thread pointer is correct PR_Lock is called and the exact same PRThread pointer is fetched once again. I hacked a PR_LockForThread function which takes the pointer as an argument and that turned out to be ~20% faster. I don't know if this is a Windows-only win but 20% is quite a lot for functions that must be thread safe but apart from that doesn't do much. I found this while looking at xpconnect overhead. The monitor overhead is as much as 70 percent for some functions there that can be called thousands or tens of thousands of times on DHTML web pages. All measurements done with Quantify on a build from CVS today 2002-01-26.
Keywords: perf
I have received at least two reports that NSPR locking seems slower on Windows 2000 than on Windows NT 4.0. So maybe it is more expensive to get thread local storage on Windows 2000? The optimization you described is simple. We should give it a try.
I did try to replace the access of the thread locale storage with the win32 api call GetCurrentThreadID but that call took almost the exact same time (~1% faster - maybe). I presume that all the time is spent in the user space -> kernel space transition so it doesn't matter what the actual call is. As for differences between different Windows versions, I can't tell. I have only this one W2k machine to test on here.
To be really fast, we would need to write some assembler code to read the register that stores the current thread identifier. If you want to give that a try, you can disassemble EnterCriticalSection() and learn from the assembly code.
I don't think that's the solution. The whole PR_Lock thing is ~10 assembler instructions long. One call to TlsGetValue and one call to EnterCriticalSection, and all the time is spent down in those two functions. My suggestion was to avoid one extra call to TlsGetValue for monitors.
*rewriting everything because NS4 lost it for me, excuse the tersity* I started coding just to find that the solution is already there guarded by a preprocessor define. If you set the environment variable USE_STATIC_TLS to 1, NSPR will switch from Tls{Set,Get}Value to _declspec(thread) which doesn't cause system calls. You have to delete the nsprpub\WIN32_?.OBJ directory to get configure to rerun and recreate the makefiles. cd mozilla rm -rf nsprpub\WIN32_?.OBJ SET USE_STATIC_TLS=1 nmake -f build_nspr The result is that PR_EnterMonitor is ~50% faster and PR_Lock is ~25% faster. I run the getElementById test in bug 118933 and the times there dropped from ~1150ms to ~1110ms. There is a minor downside though. According to Microsoft the _declspec(thread) way is not guaranteed to work if used in an DLL which is loaded manually with LoadLibrary. This is not the case for Mozilla and nspr4.dll (which is loaded automatically) but it's best to know that. cc:ing some people who might be interested in lock performance. jrgm, could you run the page load tests and see if they are affected by this? I don't know if page loads uses very much locking, with the layout engine being single threaded. It's mostly in DOM and JS/xpconnect code I've seen it and there locking can be 10-20% or even more of the time so that code will benefit.
Keywords: mozilla0.9.9
Summary: Monitors (PR_EnterMonitor and PR_ExitMonitor) can be 20% faster → Monitors and locks can on Windows can be substantially faster
Stacks, we need stacks -- who is calling PR_EnterMonitor or PR_Lock with high frequency? PR_Lock and PR_Unlock I can believe, from Necko and XPCOM code that has to deal with multiple threads contending for shared data. But since the big change for bug 54743 went in, JS should not be using NSPR conditions or locks for most, if not all, JS objects in Mozilla (because most objects are used by only one thread, ever). How is XPConnect in the picture? /be
I was looking at DHTML performance when I ended up in lock speed. For 10000 calls to getElementById (think DHTML animation) where we are ~5 times slower than MSIE (see bug 118933) we have the following locking: 44000 calls to PR_Lock from js_AddRootRT() js_AddNamedRootRT() XPCWrappedNative::AddRef() The calls to AddRef is done by by nsScriptSecurityManager::GetObjectPrincipal XPCWrappedNative::GetNewOrUsed XPCWrappedNative::QueryInterface 65000 calls to PR_Lock from PR_EnterMonitor. Those are done by 22000 calls by XPCNativeInterface::GetNewOrUsed(... nsID) XPCConvert::NativeInterface2JSObject XPCWrappedNative::CallMethod .... 6000 calls by nsComponentManagerImpl::GetService 23000 calls by XPCWrappedNative::GetNewOrUsed(... XPCWrappedNative**) XPCConvert::NativeInterface2JSObject XPCWrappedNative::CallMethod ... 43000 calls to PR_Lock from JS_RemoveRoot JS_RemoveRootRT XPCWrappedNative::Release nsCOMPtr_base::~nsCOMPtr_base(void) (and from here the stack is useless) probably the next is: XPCConvert::NativeData2JS If you want it short, it's the conversion between JS and C++ (xpconnect) that does most of the locking and almost all of the monitor use. Especially data conversion but also XPCConvert::NativeInterface2JSObject. But no comment on using static tls for Windows? Regardless of how many calls that is made to PR_Lock, we have nothing to lose by having as little overhead as possible? Also, there is a typo in my description above. to build nspr: nmake -f client.mak build_nspr
Am I reading this wrong, or is it true that USE_STATIC_TLS only has effect on win95-based OS versions, while winnt-based versions decide the issue at runtime. See: http://lxr.mozilla.org/mozilla/ident?i=_pr_use_static_tls http://lxr.mozilla.org/mozilla/search?string=USE_STATIC_TLS
Mozilla uses (and builds) NSPR for Win95 regardless of build machine. I guess the NT version is a version only used by some Netscape server products so I never looked very much at that code.
It is not hard to modify the "WIN95" version of NSPR so that it uses static TLS automatically whenever it can. However, I think you will get bigger gain by reducing the number of PR_EnterMonitor or PR_Lock calls.
I don't see it as one or another. I see it as both reducing the number of calls and making the calls as fast as possible. There are other bugs about the number of calls. Let's this one be about that we should use static tls so the calls are as fast as possible. As far as I'm concerned it's so easy (the build flag change) that it should be done today.
question. The 'minor downside' you mentioned worries me a bit. How does this effect all of our component libraries which are loaded via LoadLibrary(). Could you point me at the documention you are referencing?
(Ah, right. Forgot about NSPR building as win95). Anyways, I pulled a new tree at ~9pm Jan 27 and updated a current one to the same time (with no other changes to it). In the new tree, I did 'SET USE_STATIC_TLS=1'. With those builds, I tested the getElementById test mentioned above, the pageload test, a window opening test and a startup test. Sorry to say, but I am not seeing a significant difference in the results for the two different builds for these tests, including for the getElementById test. I can't explain why I can't reproduce your result for this test. (I did check that, at least, I have different build sizes for nspr4.dll. Not sure what else to check to make sure this is truly in my build.)
The time we spend in the locking code in any of the DOM tests might still be dwarfed by the time we spend in the security manager code. Please test again with the patch in bug 119646, of wait until that patch has landed...
Strange. I'm definitely seeing a difference in the getElementById test. I reran the tests several times to make sure and the gain was ~5% wall clock time all the time. I haven't applied the CanAccess patch. My computer is a W2kSP2 computer. wtc mentioned that locking has seen slower on W2k. Could it be that the tls{get,set}value calls only cost much there and you tested with some other OS? Otherwise, I can't explain why you don't see anything. You could put garbage inside the "wrong" defines in w95thred.c to verify that the correct code is built. dougt, read the final paragraph at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HTML/_core_rules_and_limitations_for_tls.asp I've been running with the change since the day before yesterday and haven't had a single problem with it.
I guess I do see a difference in the two runs. One thing was that I was running this test the other night with a file:// url, where (a) it's slightly faster than http [how come?], and (b) the difference is not so clear. via http:// url: via file:// url: A) Avg (10runs) : 2614ms / 2594 / 2625 Avg (10runs) : 2615ms / 2593 / 2640 B) Avg (10runs) : 2667ms / 2656 / 2672 Avg (10runs) : 2634ms / 2610 / 2657 about 2% faster on average about 0.7% faster on average (A) is with USE_STATIC_TLS defined; (B) is with it undefined (both current trunk) The above runs are "typical", but I did get one run with the file:// url and a (A) build that was ~2600 on average, and the absolute max and min of the ranges would be larger than shown above. (That is, I discounted the difference in the results that I saw last night when I saw the ranges overlapping so much for the file:// tests; but with http:// tests, it is a clearer, but narrow separation). Note: I am running win2k "gold" [no service packs] on 500MHz/128MB. I tried using the patch from bug 119646, but it actually doesn't build; I took a flail at fixing the bustage, but then when it built it crashed on the test :-(. Oh, well.
> Avg (10runs) : 2615ms / 2593 / 2640 Note, I modified the test's reporting to showing min and max values, which is what is shown in the second and third time fields above.
USE_STATIC_TLS is a developer option to allow an NSPR developer to experiment with different implementation strategies. Mozilla should not build NSPR with this option. If you want static TLS, we can modify the "WIN95" version of NSPR to detect at run time whether it is safe to use static TLS. We already do this in the "WINNT" version of NSPR.
Here is the dynamic solution almost copied straight from the NT code. Now thread local storage will be used unless nspr.dll is loaded dynamically with an explicit LoadLibrary call. As mentioned earlier this makes locking with PR_Lock ~25% faster on Windows 2000 and PR_Monitor ~50% faster. This should affect xpconnect code the most since it uses locks all the time. The trivial getElementById function was ~3-4% faster by using thread local storage instead of making system calls for locking. I would very much like to have this in Mozilla 1.0 since it's such a trivial change (copied from existing, proven code) with measurable effect.
wtc, can you review this? I don't know the procedure to get patches into NSPR so if I'm doing it the wrong way please tell me.
Keywords: mozilla1.0, patch
Summary: Monitors and locks can on Windows can be substantially faster → Monitors and locks on Windows use unnecessary system calls
Whiteboard: [HAS FIX]
Priority: -- → P1
Target Milestone: --- → 4.2
Blocks: 133659
Comment on attachment 76047 [details] [diff] [review] Patch to switch strategy dynamically This patch is correct except for one bug in w95dllmain.c. >@@ -53,6 +55,15 @@ > case DLL_PROCESS_ATTACH: > break; > case DLL_THREAD_ATTACH: >+ /* >+ * If lpvReserved is NULL, we are dynamically loaded >+ * and therefore can't use static thread-local storage. >+ */ >+ if (lpvReserved == NULL) { >+ _pr_use_static_tls = FALSE; >+ } else { >+ _pr_use_static_tls = TRUE; >+ } > break; > case DLL_THREAD_DETACH: > if (_pr_initialized) { The new code should be added under the DLL_PROCESS_ATTACH case. I will attach a new patch.
Attachment #76047 - Flags: needs-work+
I fixed the minor bug in DllMain. I also adjusted the comments and white space to match the "WINNT" version more closely.
Attachment #76047 - Attachment is obsolete: true
Patch checked into the tip of NSPR. I will check it in on the NSPRPUB_PRE_4_2_CLIENT_BRANCH as well. Marked the bug fixed.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Part of the "minor downside" of using static TLS is bug 135570 -- inability to install on all Win95 and scattered Win98 systems.
So how do we fix bug 135570? Give up on the perf wins and back this out, or ship an extra copy of nspr4.dll along with the installer for use on Win95 systems?
I *really* doubt that we will trade a 2% speed win for a download size win.
It's not just the size, someone's got to figure out how to coerce the nspr build system to build two versions of nspr4.dll, and then change the install code to figure out how to use them both. Not rocket science, but not so easy it'll get done by tomorrow's deadline as compared to backing this out so we can ship a beta on Win95.
In light of Microsoft Knowledge Base article Q118816 (http://support.microsoft.com/default.aspx?scid=kb;EN-US;q118816), the patch should be backed out and this bug should be resolved WONTFIX as long as we need to support Windows 95.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
the pain! can we think of any other way to support win95 while maintaining the performance improvements that this fix gains us?
dougt: http://bugzilla.mozilla.org/show_bug.cgi?id=135570#c68 If I understand the problem correctly as long as the exe in question links to nspr4.dll then the TLS will be happy - even if a later attempt is made to load nspr4.dll (directly or indirectly) via LoadLibrary. Whether or not this is an acceptable solution is debatable.
Doug, John, NSPR needs to support more than the Mozilla client and its installer. This patch breaks any NSPR client that loads nspr4.dll with a LoadLibrary call on Windows 95. As long as NSPR needs to support Windows 95, this is not acceptable.
If we back out the patch it looks like this is controlled by a compiler flag. We could tickle the build scripts to produce two versions of the nspr library on windows. Mozilla could use the faster version, and the original version could be used by the installer and any other embedding client that dynamically loads the library. Once we create the Mozilla Runtime Environment that'll mean all embeddors, right?. Apps will look up the centrally-installed correct version of the MRE in the registry and use LoadLibrary() to get it going. The first step is to back out this patch and send the performance gurus back to the drawing board on how to get the faster second version out of the build system.
FWIW sr=jband on backing out the change.
a=dbaron on backing out on 1.0 branch
Blocks: 135570
The fix has been backed out of the tip, NSPRPUB_PRE_4_2_CLIENT_BRANCH, and MOZILLA_1_0_0_BRANCH of NSPR. Marked the bug WONTFIX. The fix can be checked back in when we don't need to support Windows 95.
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → WONTFIX
Reopening. The issue is still there. Now that we know that the way recommended in comment 18 breaks Win95, I would like to try the "Two types of NSPR"-way with the USE_STATIC_TLS compiler flag as was dicussed at the start. This really affects lock heavy code. And even if we don't want to change that, there's still the issue of PR_{Enter,Exit}Monitor calling TlsGet twice to get the same value. Once in its own function and once in PR_Lock.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
QA Contact: wtchang → nspr
The target milestone is already released. Resetting target milestone.
Target Milestone: 4.2 → ---
(In reply to comment #35) > The fix has been backed out of the tip, NSPRPUB_PRE_4_2_CLIENT_BRANCH, > and MOZILLA_1_0_0_BRANCH of NSPR. Marked the bug WONTFIX. > > The fix can be checked back in when we don't need to support Windows 95. now that win9x is dead...
Status: REOPENED → NEW

The bug assignee didn't login in Bugzilla in the last 7 months.
:KaiE, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: wtc → nobody
Flags: needinfo?(kaie)
Status: NEW → RESOLVED
Closed: 23 years ago3 years ago
Flags: needinfo?(kaie)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: