Closed Bug 120992 Opened 24 years ago Closed 23 years ago

Decimal to string conversion very slow because of locking

Categories

(Core :: JavaScript Engine, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.1alpha

People

(Reporter: bratell, Assigned: bratell)

References

Details

(Keywords: js1.5, perf)

Attachments

(4 files, 10 obsolete files)

516 bytes, text/plain
Details
25.51 KB, image/png
Details
6.25 KB, patch
khanson
: review+
brendan
: superreview+
Details | Diff | Splinter Review
1.23 KB, text/html
Details
js_dtoa is very slow compared to how fast it should be. It's for instance 250 times slower than converting an integer to a string. That's all because of locking. For some reason js_dtoa help functions diff, lshift, i2b, d2b and mult all call Balloc to get a buffer which ends up doing locking with PR_Lock and PR_Unlock. For 10000 calls to js_dtoa I see 235 000 calls to Balloc. Balloc and Bfree is about 75% of the function. The rest is divided between quorem, multadd, diff and the function JS_dtoa iself. I tried to look at the function but it looks to pure greek to me. In fact I think greek is somewhat easier unless you were the one to write the function.
This patch passes around a stack based allocator that in the normal cases I've tested means that we never use the heap or the shared buffer. This saves 70% or more of the function time (that is 200ms on the sort test case page). It seems to work, so I could use some reviews. The patch is a couple of hundred lines long, but most of the lines are just changed to pass around a context (ctx) that's used when allocating memory. Brendan?
Assigning to myself since I did a patch after all.
Assignee: rogerl → bratell
Keywords: patch, perf
Blocks: 117611
Yeah, the David M. Gay (based on Guy Steele) dtoa code is crufty, and the way thread-safety was hacked in without undoing the needless contention through Balloc, is all wrong. Thanks for the patch. Kenton, could you review first and comment? I'm tied up elsewhere. /be
Keywords: js1.5, mozilla0.9.9
Just don't review this just yet. I've found a problem with the p5s routine that keeps Bigints on a static list. It used to work fine, but not anymore when the Bigints are coming from the stack. I didn't see that before because apparently the stack buffer was so small that all bigints in p5s happened to be allocated on the heap. I then tuned the stack size and now it misbehaves. Patch coming up tonight.
Attached patch New better patch. (obsolete) — Splinter Review
This is the promised patch. dtoa is now 2.7 times faster than before in my tests. Since the previous patch I've tuned the constants so that the heap is very seldom needed, after all, almost all decimals sent in is small. Also, the Bigints are moved to the heap before being put on the shared list so that they can be reused. In 20000 calls I see locking twice and heap allocation once. The rest of the function time is spent in maths that is better left to people who understands it. khanson, can you review this?
Attachment #65835 - Attachment is obsolete: true
Comment on attachment 65894 [details] [diff] [review] New better patch. Yes I will review this patch. I also have been thinking about ways to optimize dtoa =Kenton=
@@ -702,10 +805,10 @@ } PR_Unlock(p5s_lock); if (wasted_effort) { - Bfree(wasted_effort); + fastBfree(ctx, wasted_effort); } #else - p51 = p5->next = mult(p5,p5); + p51 = p5->next = moveToSharedMemory(mult(ctx, p5,p5)); should probably be + p51 = p5->next = moveToSharedMemory(ctx, mult(ctx, p5,p5)); I am having trouble. I enter to the console "x = 3.3e199;" then when I enter "x" to the console I get an "unmapped memory exception." FastBalloc is producing an unintialized value in bi->k, leading me to believe that a pointer is invalid. I am using Codewarrior on a Mac. =Kenton=
Yes, the ctx should be there. I guess there must be some bug when building without JS_THREADSAFE. How do I build and run the console (I presume you don't meen the JavaScript Console in Mozilla)? I tried 3.3e199 in the javascript console in the tools menu in Mozilla and got "3.3e+199" back. The bi->k returned from fastBalloc should have been set in initFastDToAContext so a damaged pointer sounds like a plausible explanation. I will try to find it as soon as I figure out how to build single threaded.
Attached patch With one more stack->heap move (obsolete) — Splinter Review
I found one bug in the not threadsafe code. 625 wasn't moved to the heap before being put on the shared powers-of-five-list. That would probably cause a crash. khanson, what do you think now?
Attachment #65894 - Attachment is obsolete: true
*** Bug 99120 has been marked as a duplicate of this bug. ***
99120 not a duplicate of this bug. I hit the wrong button!
Daniel, I agree with your analysis. The last patch works fine. I want to do a some more testing and will r= it soon. I was a little concerned with the prefix "fast". As the code ages hopefully the improvements will be viewed as normal. Thanks for the patch, =Kenton=
I'm more than willing to change that name. It started as a working name while hacking and then I never changed it. It's not guaranteed to be "fast" so fast is not a very good prefix anyway. I'll change it to something neutral before checking in (Balloc? LocalBalloc?).
Local sounds good to me -- my 2 cents. Or Auto if that's the storage class (cf nsAutoString in mozilla/string code). /be
Attached patch fast -> local naming change (obsolete) — Splinter Review
New patch, but the only thing that has changed is that "fast" is replaced by "local". I beg for more reviews. :-)
Attachment #67224 - Attachment is obsolete: true
Comment on attachment 67480 [details] [diff] [review] fast -> local naming change r=khanson Great patch! I am wondering if keeping all the powers of 5 less than say 5^50, would improve performance. These get rebuilt over and over during binary <-> decimal conversions. Daniel, you might consider reviewing bug #85267. Thanks Kenton
Attachment #67480 - Flags: review+
Just need sr= to get this patch in -
Brendan, does this deserve your sr=
Blocks: 85267
Looking for sr= on this patch
Comment on attachment 67480 [details] [diff] [review] fast -> local naming change Patch needs a minor update to reflect changes made by bug#130711
Attachment #67480 - Flags: review+ → needs-work+
Attached patch Merged with latest on the trunk (obsolete) — Splinter Review
Oh, I thought I had already uploaded this patch. I've had the patch with the bug 130711 included in this fix all the time. So, no changes, just merged with the trunk. New r and sr, khanson, brendan, whoever feels competent?
Attachment #67480 - Attachment is obsolete: true
Daniel, I'm not seeing much of a performance improvement with this patch. Granted, I may be looking at a biased benchmark. Could you please provide a test program that exhibits the performance increase referenced in comment #5. Thanks, Kenton
The testcase I think I used was the array sorting test on <http://www.formula-one.nu/mozilla/jstestcases/sortArray.htm>. I have no clean build to compare with so I can't do the test now. Also, I see that you're test is full of integers. Notice that this code is for decimal (floating point) number conversion to string. It might be so that your test don't exercise this code at all. Filling an array with fibonacci numbers which are very integery by nature will probably use int->string conversion.
You might also find a test case in the long javascript URL at http://bugzilla.mozilla.org/showattachment.cgi?attach_id=72417 I'd be interested to know if this patch affects performance in that test. On that test, NN is an order of magnitude faster than Mozilla. For details, see http://bugzilla.mozilla.org/show_bug.cgi?id=70054#c5 -matt
Daniel, The fibonacci series I used produces mostly numbers that can only be represented in floating-point (greatly exceeding the C int range.) My test program should be a good test for binary<->decimal conversions and should exercise your patch. I worked with the test program you referenced and did not see any appreciable improvement. I might be testing in the wrong environment or not testing the correct profile of numbers. I would really like to reproduce the 2.7 factor of improvement you achieved in jsdtoa.c (comment #5.) I?m open to suggestions. =Kenton=
Changed the powers of 5 routine to be table driven for all powers less than or equal to 32. Then tried all powers of 5 less than or equal to 64. I found no appreciable performance improvement.
I'm a little curious why you don't see anything with this patch. I did both profiles and wall time measurement when I wrote it and both showed an improvement. Now I have no build to compare with so I can not repeat my tests. It may be that your tests have something else taking so much time that the conversion isn't measurable, or that the numbers you have in your test for some reason don't trigger the code that used to be very, very slow. I don't know more than that this was a win for the sort test and removed something that shouldn't have been there for such a low level routine (shared buffers requiring locking all the time). I wish I had the time to look into this more thouroghly but I don't have the time right now to set up a testing environment. :-( Could you look at it in a profiler or debugger? Just to see if js_dtoa is called and how often? I wouldn't expect a table for p5s to gain much, but it would probably make the code much easier to understand which is a good thing.
Ok, now I've looked at the profile of the testcase in a profiler. It seems as if it tests the right things quite well. When running everything in the browser, js_dtoa is still 42% of the CPU time. I do see that the test causes localBalloc to revert to heap based allocation in 20% of the allocation cases which is almost half of the time in js_dtoa. I will try to see why. It's probably just some buffer size that has to be tuned. I'll be back.
Attached patch New tuned patch (obsolete) — Splinter Review
Yes, the MAX_STACKBUFFER_EXP had to be increased from 5 to 6. Here is a new patch that is even faster. I made a webbified test from the one Kenton attached (I will attach my test too) and that one became 10% faster with this change. js_dtoa went from 2.2s to 1.5s. Kenton or Brendan, can you check the test I will attach and then review this patch? I really think we're there now. Any more improvements will have to be algorithmic. (multadd is slow, quorem and diff too)
Attachment #75816 - Attachment is obsolete: true
Attached file The previous testcase webified. (obsolete) —
This test runs in 5.5-6.0 seconds on my Duron 800MHz with the newest patch applied.
I get much lower times when running the test from bugzilla.mozilla.org then when I ran it from my own local disk. Strange. The last patch times in at 4.25s. The next to last patch is at 4.55s. I have no clean build to test with but I would guess that a trunk mozilla would be at ~5-6s.
Best to test a .js file via xpcshell, not a .html page via mozilla, I think. /be
Oh, I didn't know about xpcshell. Looks useful. But, anyway, for this I was only interested in an easy testcase in a real environment (multi threaded, GC triggering, external events....) that could show that the patch make a difference, and for that the test suffices.
I tested the latest patch #79060 in the Codewarrior environment on a Macintosh. The patch resulted in a 6% degradation of performance. I will try the VC++ environment on Windows.
How can that be even remotely possible???? The patch adds an argument to the function calls. It adds an initialization that might be ~100 machine code instructions. It uses ~1KB more of the stack. That's the downside. The other side is that it removes doussins of calls to PR_Lock and malloc and free, which are costly both in heap usage and in clock cycles. Comparing the two sides, this should save us immensely more then it costs. I've studied this in Windows profilers. I've checked every line of the patch. I can see nothing that show anything else than a great speedup. In profilers. In real life. In Mozilla. Could you, please, do a profile to see what _really_ goes on? What environment are you testing in? Are you sure it's quiet, no external processes that disturb? Is everything really built optimized?
Attached file Higher lock fix for jsdtoa.c (obsolete) —
Locks are done at a higher level with this new code. All changes are marked by //perf.
Ah, inline. Do you know how much of a difference that made? Btw, jsdtoa.c is a C file. Didn't inline appear first in C99? In that case this might break a few compilers. It should be possible though to make it a #define so that supporting platforms can get it if it looks promising.
The inline helped the OS/2 a great deal. I did not see very much improvement for NT. The OS/2 compiler is really bad. I will try to get my results out on bugzilla.
I ran 2 sets of tests on 2 operating systems (OS/2 and NT 4.0). I ran the test Daniel included in bugzilla and the Parseint test that is part of http://www.formula-one.nu/mozilla/jsTimeTest.htm . I tested using the original nightly builds of 4/15/02. I rebuild the javascript dll using Daniel's stack fix and my higher lock fix (see attachment). The tests which were run on a dual boot 333Mhz Pentium 2 are displayed below: O/2 Tests 4/15 orig 4/15 w/stack fix 4/15 w/lock fix Parseint test 45.8 secs 37.5 secs 32.5 secs Daniels's test 17.6 secs 14.5 secs 15.5 secs NT 4.0 Tests 4/15 orig 4/15 w/stack fix 4/15 w/lock fix IE6.0 Parseint test 37.1 secs 35.3 secs 32.4 secs 18.0 secs Daniels's test 15.4 secs 14.8 secs 14.2 secs 4.2 secs Test machine was a 333 MHz - 128MB, Ethernet,Win NT 4.0 SP4, OS/2 CP4.52 I think the problem with the stack fix may be the extra parm that is passed around. Our traces show that on NT before the fixes were applied about 15% is spent using locks. In OS/2 the number is about 35%. Ivan
I ran 2 sets of tests on 2 operating systems (OS/2 and NT 4.0). I ran the test Daniel included in bugzilla and the Parseint test that is part of http://www.formula-one.nu/mozilla/jsTimeTest.htm . I tested using the original nightly builds of 4/15/02. I rebuild the javascript dll using Daniel's stack fix and my higher lock fix (see attachment). The tests which were run on a dual boot 333Mhz Pentium 2 are displayed below: O/2 Tests 4/15 orig 4/15 w/stack fix 4/15 w/lock fix Parseint test 45.8 secs 37.5 secs 32.5 secs Daniels's test 17.6 secs 15.5 secs* 14.5 secs* NT 4.0 Tests 4/15 orig 4/15 w/stack fix 4/15 w/lock fix IE6.0 Parseint test 37.1 secs 35.3 secs 32.4 secs 18.0 secs Daniels's test 15.4 secs 14.8 secs 14.2 secs 4.2 secs Test machine was a 333 MHz - 128MB, Ethernet,Win NT 4.0 SP4, OS/2 CP4.52 I think the problem with the stack fix may be the extra parm that is passed around. Our traces show that on NT before the fixes were applied about 15% is spent using locks. In OS/2 the number is about 35%. * corrected previous data. Ivan
You guys may want to look at bug 138666. I found another memory leak (very similar to the last one I found), and you may want to incorporate the small change there into your patches here. --scole
Attached image Graph over alternatives
Now, I've done thourogh measurements of the alternatives on Windows 2000. I've done the tests in xpcshell (thank you Brendan!) and I've done tests with the current code in CVS, the stack option and the high level lock option. I've also tested the high level lock option and stack option with and without inlining of free and cmp. The test was Hanson's test but with 50 iterations. I then ran the test 25-30 times and recorded the times as reported by the test. The difference between the options are obvious when graphed. For some reason all test runs follow the same pattern making them extra easy to compare. Neglecting the inline the high level lock is 250ms faster than the original and the stack buffer is 150ms faster. Since js_dtoa is ~50% of the total time, that would account for a speedup of 12% and 8%. The inline of free and cmp then brings another 150ms and that is regardless of locking solution. The tree also contained code from the next version of NSPR which has faster locking on Windows so the improvements in a current tree (with an old NSPR) are probably bigger. All results are also confirming Ivan's numbers so now I think we know where we are. My proposal is therefore to use Ivan's high level locks without inlining and then open a new bug about inlining where we can discuss which functions should be inlined, along with compatibility and footprint issues.
Here is Ivan's patch without inlines and with matching ACQUIRE_LOCK and FREE_LOCK. The locks are placed in the JS_FRIEND_API functions. I've opened bug 138741 about inlining functions. khanson, review? The patch is very short.
Sam discovered that prdtoa.c has the same locking problems as jsdtoa.c . Should another bug be opened for this fix to be applied to prdtoa.c ? Note, the fix would look pretty much the same as for jsdtoa.c .
Better be another bug. prdtoa, I guess that's NSPR, and NSPR have a different release schedule, different module owners and checkin policy.
Comment on attachment 80175 [details] [diff] [review] Updated high level locking. No inline. r=khanson. This looks good to me. Thank you Ivan and Daniel for the fix and anaylsis.
Attachment #80175 - Flags: review+
Comment on attachment 80175 [details] [diff] [review] Updated high level locking. No inline. >Index: jsdtoa.c >=================================================================== >RCS file: /cvsroot/mozilla/js/src/jsdtoa.c,v >retrieving revision 3.16 >diff -u -r3.16 jsdtoa.c >--- jsdtoa.c 19 Mar 2002 04:28:20 -0000 3.16 >+++ jsdtoa.c 20 Apr 2002 08:17:04 -0000 >@@ -331,7 +331,8 @@ > > static Bigint *freelist[Kmax+1]; > >-/* Allocate a Bigint with 2^k words. */ >+/* Allocate a Bigint with 2^k words. >+ * This is not threadsafe. The caller must use thread locks */ Nit: prevailing JS comment style for multiline comments is /* * blah blah * blah. */ But maybe this matches jsdtoa.c's "style" (not coherent enough for me to avoid sarcastic quotes :-). > static Bigint *Balloc(int32 k) > { > int32 x; >@@ -340,10 +341,8 @@ > uint32 len; > #endif > >- ACQUIRE_DTOA_LOCK(0); > if ((rv = freelist[k]) != NULL) > freelist[k] = rv->next; >- FREE_DTOA_LOCK(0); Nit: the rest of the engine uses FREE for memory deallocation and RELEASE as the antonym to ACQUIRE for locking. Another nit: the (n) parameter to these ACQUIRE/FREE_DTOA_LOCK macros seems useless. We own this fork of the old DMG code, why not clean it up? >@@ -1093,6 +1090,8 @@ > #ifdef JS_THREADSAFE > if (!initialized) InitDtoa(); > #endif >+ /* Locking for Balloc's shared buffers */ >+ ACQUIRE_DTOA_LOCK(0); > > *err = 0; > >@@ -1679,6 +1678,7 @@ > ret: > if (se) > *se = (char *)s; >+ FREE_DTOA_LOCK(0); > return sign ? -rv : rv; > } > Good to bracket the whole routine and consolidate returns (as was already done), but why lock longer than needed? Couldn't you ACQUIRE just after the break2: label, and RELEASE before setting *se conditionally -- that is, immediately after the ret: label? >@@ -2496,14 +2496,19 @@ > /* the sign and/or decimal point */ > char *numEnd; /* Pointer past the digits returned by JS_dtoa */ > >+ /* Locking for Balloc's shared buffers */ >+ ACQUIRE_DTOA_LOCK(0); >+ > JS_ASSERT(bufferSize >= (size_t)(mode <= DTOSTR_STANDARD_EXPONENTIAL ? DTOSTR_STANDARD_BUFFER_SIZE : > DTOSTR_VARIABLE_BUFFER_SIZE(precision))); > > if (mode == DTOSTR_FIXED && (d >= 1e21 || d <= -1e21)) > mode = DTOSTR_STANDARD; /* Change mode here rather than below because the buffer may not be large enough to hold a large integer. */ > >- if (!JS_dtoa(d, dtoaModes[mode], mode >= DTOSTR_FIXED, precision, &decPt, &sign, &numEnd, numBegin, bufferSize-2)) >+ if (!JS_dtoa(d, dtoaModes[mode], mode >= DTOSTR_FIXED, precision, &decPt, &sign, &numEnd, numBegin, bufferSize-2)) { >+ FREE_DTOA_LOCK(0); > return 0; >+ } > > nDigits = numEnd - numBegin; > Same comment here: why are we holding the lock across all the string fiddling that doesn't need it? >@@ -2598,6 +2603,8 @@ > (word1(d) || (word0(d) & Frac_mask)))) { > *--numBegin = '-'; > } >+ >+ FREE_DTOA_LOCK(0); > return numBegin; > } > IOW, lock around the JS_dtoa call only. BTW, that function is static, and callers must be in jsdtoa.c because they must lock around calls to it, so it should be named js_dtoa (JS_ is for public APIs). >@@ -2657,6 +2664,9 @@ > double di; /* d truncated to an integer */ > double df; /* The fractional part of d */ > >+ /* Locking for Balloc's shared buffers */ >+ ACQUIRE_DTOA_LOCK(0); >+ > JS_ASSERT(base >= 2 && base <= 36); > > buffer = (char*) malloc(DTOBASESTR_BUFFER_SIZE); Same deal here: do not lock around a malloc call, it increases contention and risks deadlock. >@@ -2674,6 +2684,7 @@ > /* Check for Infinity and NaN */ > if ((word0(d) & Exp_mask) == Exp_mask) { > strcpy(p, !word1(d) && !(word0(d) & Frac_mask) ? "Infinity" : "NaN"); >+ FREE_DTOA_LOCK(0); > return buffer; > } > >@@ -2813,5 +2824,7 @@ > JS_ASSERT(p < buffer + DTOBASESTR_BUFFER_SIZE); > *p = '\0'; > } >+ >+ FREE_DTOA_LOCK(0); > return buffer; > } Thanks for the great work here, I hope you can fix the over-long critical section problems above, and pick the nits. I'll gladly sr= a patch that addresses those issues. /be
This should address all your points I think. The locks are held for shorter times and the function is renamed to js_dtoa (lowercase js).
Attachment #79060 - Attachment is obsolete: true
Attachment #79867 - Attachment is obsolete: true
Attachment #80175 - Attachment is obsolete: true
Urk. Would it be too much to ask that ACQUIRE_DTOA_LOCK be changed to ACQUIRE_DTOA_LOCK() so that it still looks like a function? (And the RELEASE too, naturally...) Without the parens it looks too much like a constant macro, not like a code macro... --scole
Comment on attachment 83926 [details] [diff] [review] Patch addressing Brendan's comments Yeah, what scole said. Make macros that expand to non-constant expressions look like function calls by taking 0 params if necessary. /be
With macros looking like macros (my fault; I misunderstood) and some comments updated.
Attachment #83926 - Attachment is obsolete: true
Comment on attachment 84018 [details] [diff] [review] The New patch which will bring happiness to all the world. r=khanson. Looks good. Did some testing and saw a 1.3% increase in speed (similar to the results in attachment #80173 [details]).
Attachment #84018 - Flags: review+
Comment on attachment 84018 [details] [diff] [review] The New patch which will bring happiness to all the world. sr=brendan@mozilla.org, thanks! Let's get this into the 1.1alpha trunk. /be
Attachment #84018 - Flags: superreview+
Fix is checked in. Anyone interested in the inline issue should head over to bug 138741.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.1alpha
Daniel - see bug 145493
I just made a small adjustment to the checked in patch. Bad indentation (no indentation at all) made me miss that one of the ACQUIRE_JSDTOA_LOCK was inside a for loop that sometimes wasn't entered. It asserted badly but I hadn't run a debug build since doing the changes wanted by reviewers so I didn't see it until now. Bad me. I got mozbot to do a quick review. diff -u -r3.18 jsdtoa.c --- jsdtoa.c 18 May 2002 06:21:49 -0000 3.18 +++ jsdtoa.c 18 May 2002 17:29:00 -0000 @@ -1097,6 +1097,10 @@ bb = bd = bs = delta = NULL; sign = nz0 = nz = 0; rv = 0.; + + /* Locking for Balloc's shared buffers that will be used in this block */ + ACQUIRE_DTOA_LOCK(); + for(s = s00;;s++) switch(*s) { case '-': sign = 1; @@ -1119,8 +1123,6 @@ goto break2; } break2: - /* Locking for Balloc's shared buffers that will be used in this block */ - ACQUIRE_DTOA_LOCK(); if (*s == '0') { nz0 = 1;
The html testcase is messed up for this bug. It has a call to print(c) which pops up the print dialog in a nearly endless loop. Someone is a bit too used to writing C, me thinks. It should be document.write(c)...however, when I run the testcase in Mozilla, the last document.write statement never gets printed and the browser just keeps running the testcase. IE finishes up just fine. Attaching testcase in a minute.... Jake
This is an html4.01 Strict valid testcase + in uses document.write(c) instead of print(c) so that the print dialogue won't be popped up in a nearly endless loop.
Attachment #79063 - Attachment is obsolete: true
Marking Verified Fixed. Here are some of my results (all times in ms). WinNT 4.0 (SP6); 500MHz CPU; 128M RAM Kenton's test from Comment #22 run directly in the JS shell: RC4 (Jan 01) RC4a (Mar 21) Current tip (May 20) OPTIMIZED JS SHELL 1360 1235 1260 DEBUG JS SHELL 2688 2625 2641 The parseInt() test from http://www.formula-one.nu/mozilla/jsTimeTest.htm Using Mozilla trunk binaries: 2002-05-14 2002-05-20 27296 23797 26906 23782 27109 23828 The JS shells I used are separated by months, and span fixes for other bugs as well as this one. That may explain the slight decrease in performance on Kenton's test between RC4a and the current tip. The Mozilla builds come from just a couple days before and after this fix, and show a clear improvement -
Status: RESOLVED → VERIFIED
Flags: testcase?
Flags: testcase? → testcase-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: