Closed Bug 157334 Opened 22 years ago Closed 14 years ago

IE6 3x faster than Mozilla, concatenating two local strings

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: pschwartau, Unassigned)

References

Details

(Keywords: perf, testcase, Whiteboard: [For best timing results, save and run testcases locally])

Attachments

(8 files, 12 obsolete files)

6.88 KB, text/html
Details
3.40 KB, text/plain
Details
769 bytes, patch
Details | Diff | Splinter Review
2.98 KB, text/plain
Details
153.93 KB, patch
Details | Diff | Splinter Review
126.25 KB, patch
Details | Diff | Splinter Review
3.60 KB, patch
Details | Diff | Splinter Review
3.42 KB, patch
Details | Diff | Splinter Review
This bug arose from the JS Engine performance tracking bug 117611.

It was noted that Mozilla under-performs on Peter Nilsson's string
concatenation test at http://www.formula-one.nu/mozilla/jsTestcases/,
which is structured schematically as follows: 


                  function test()
                  {
                    var str1= a very long string;
                    var str2= a very long string;
                    var str;
                                      
                    for(var i=0; i<=UBOUND; i++)
                      str = str1 + str2;
                  }


I will attach this testcase below in an HTML format that can be run
in NN4.x, Mozilla, and in IE6. I will also attach a standalone JS
version that I run via the load() function in the optimized JS shell.

The key thing to note is that all the variables are declared locally
in the function. This should avoid some DOM overhead, since the DOM
does security checks on all global variable access.

Here are results I got by trying the test in all three browsers
and in the JS shell. The Mozilla build date I used was 2002-07-01.

OS=WinNT4.0(SP6); 500MHz CPU; 128M RAM

UBOUND = 10,000 
Average time after 10 runs (in ms):

IE6  =  314
JS   =  715 
Moz  = 1045
NN4  = 1853


Note: for comparison, here are timing results on an empty loop:

UBOUND = 100,000: 
Average time after 10 runs (in ms):

IE6 =  151
JS  =   97
Moz =  167
NN4 = 2171


We see from above that IE6 is more than 3x faster than Mozilla on the
string concatenation test, and more than 2x faster than the JS shell -
Attached file HTML testcase (poorly scoped) (obsolete) —
Attached file JS shell testcase (poorly scoped) (obsolete) —
Blocks: 117611
Keywords: perf
I'm seeing even worse results.  I did 10 iterations and got:

IE6 =  40.1
Moz = 490.4

I then ran it out to 70 iterations and got:

IE6 =  40.02
Moz = 500.6

Evaluations in IE ranged from 30-50 ms.
Evaluations in Moz ranged from 440-620 ms.

On Win2k (SP2), 1200 MHz CPU, 512M RAM with Mozilla 0704.
Moz 2002071208: Average time (in ms) = 1048.3
Nwtscape 4.79:  Average time (in ms) = 318.6
IE 6:           Average time (in ms) = 58.2
Jerry: can you give us the OS, CPU, and RAM of your PC? 
I'm surprised that NN4.79 is faster than Mozilla for you on this.

What results do you get if you save the testcase locally and run it
off your disk?  That's the best way (or were you doing that already?).

Also, do you have anything set in Mozilla's Edit > Preferences ?
Thanks -
Whiteboard: [For best timing results, save and run testcases locally]
Local files:
Mozilla: Average time (in ms) = 481.47
NN4:     Average time (in ms) = 411.80
IE6:     Average time (in ms) = 58.61

Windows XP; Athlon 800, 512MB PC-133

I have some prefs to keep windows from popping up, but that's the only prefs I
have set w/regards to JS.
Phil, Jerry: please try the test in xpcshell and report results.

/be
Trying the above JS shell testcase in my xpcshell.
OS=WinNT4.0(SP6); 500MHz CPU; 128M RAM; Mozilla 20020701xx

JS shell: Average time (in ms) = 710
xpcshell: Average time (in ms) = 660
Mozilla:  Average time (in ms) = 1018

Most contributors will not have an optimized JS shell handy, 
but Mozilla automatically comes with the xpcshell.exe. 

STEPS TO TEST IN XPCSHELL:
1. Save the JS shell testcase locally as, say, Concat_Two_Strings.js
2. Save it in the same directory that your Mozilla binary is in.
3. Open a console window; for example MS-DOS or CygWin
4. cd to your Mozilla directory
5. At the command prompt, type  ./xpcshell.exe -f Concat_Two_Strings.js
6. (RESULTS WILL PRINT OUT)
7. Hit the up arrow and <Enter> to try again, etc.
8. End your xpcshell session by typing the command quit()
XPCShell:

Average time (in ms) = 260.4
Doing 100 runs:

xpcshell = 204.3

Evaluations in xpcshell ranged from 190-211 ms.
> The key thing to note is that all the variables are declared locally
> in the function. This should avoid some DOM overhead, since the DOM
> does security checks on all global variable access.


True, but in my adaptation of Peter Nilsson's test, I mistakenly
used a global variable |UBOUND| in the for-statement of the loop:


                  function test()
                  {
                    var str1= a very long string;
                    var str2= a very long string;
                    var str;
                                      
                    for(var i=0; i<=UBOUND; i++)
                      str = str1 + str2;
                  }



I will attach corrections below, and leave the two testcases
above in case anyone wants to compare how much DOM overhead
is incurred by a global |UBOUND|.
Attachment #91234 - Attachment description: HTML testcase → HTML testcase (poorly scoped)
Attachment #91234 - Attachment is obsolete: true
Attachment #91235 - Attachment description: JS shell testcase → JS shell testcase (poorly scoped)
Attachment #91235 - Attachment is obsolete: true
It was noted in bug 117611 comment 63 that Mozilla appears to have
slowed down on the browser version of the string concatenation test,
when measured from the 1.0 release to today's current trunk.

Brendan asked for pure JS engine testing on this, since the browser
incorporates other modules which could be causing the slowdown.

I did this using the pure JS shell testcase in Comment #13 above.
I compared today's current trunk vs. the CVS tag MOZILLA_1_0_RELEASE.

As timeless notes in bug 117611 comment 65, to build via Makefile.ref
it is necessary to copy over such subdirectories as js/src/editline/
and js/src/config/, since they didn't get the MOZILLA_1_0_RELEASE tag.

Here are my results on WinNT4.0(SP6), 500MHz CPU, 128M RAM.
All times in ms:
                       
                   MOZILLA_1_0_RELEASE   Current trunk (2003-02-17)
UBOUND = 10000            690                 681                
UBOUND = 100000           6917                6837

These show a nominal JS Engine performance improvement on this test.
So it looks like any observed slowdown on this is browser-related -
OS: Windows NT → All
Hardware: PC → All
Do we have any chance of finding out where in the browser this slowdown is 
based on?
Keywords: testcase
Athlon XP 1800+

Firefox 0.9:
Average time (in ms) = 896.93
Excluding first run  = 891.78

IE 6:

Average time (in ms) = 31.06
Excluding first run  = 29.92

Is there any chance to find a reason?
Uh, that's way off, and I wonder whether you'd find a similar difference if you
got rid fo the string concatenation and just ran a simple loop test.

Cc'ing jst, peterv, nallen.

Obviously this could be profiled.  jprof or another such sampler, or quantify or
another such instrumenter.  Cc'ing bz in case he's back and interested.

/be
Assignee: khanson → general
Component: JavaScript Engine → DOM: Level 0
There is also big difference between Linux and Windows here.
Fx Linux gave results ~140 (4x slower than IE) while Fx Windows gave me results
between 600 and 800 (20x slower)
I tried poking into this by looking at the difference in performance between
browser and xpcshell.  I had some trouble profiling because the browser seems to
have two different performance modes.  mode 1 was the baseline (280 ms), mode 2
took ~20% less time (220 ms), xpcshell took ~40% less time (160 ms), and IE took
~80% less time (60 ms).  Much of the test case time was spent in GC so I focused
on that.

When doing concat, there's a memory allocation failure path:

js_ConcatStrings:182
js_NewString:2439
js_AllocGCThing:529
js_GC

which is the last ditch allocator to try again.  xpcshell ran GC through that
path quite a bit taking relatively short GCs.  mode 2 ran GC through that path
an order of magnitude less frequently taking fewer, longer GCs spending a lot of
time in js_MarkGCThing.  mode 1 almost never ran GC through that path.  It was
instead taking the JS_MaybeGC on the dom branch callback and running many, long GCs.

I can't duplicate the extremely bad performance numbers.  This needs to be
profiled on one of the machines that sees 25x+ slowdown compared to IE.  But I
wouldn't be surprised if the % time in GC was huge.
Oh, these last ditch GCs are coming from
http://lxr.mozilla.org/mozilla/search?string=gGCSize.  Nallen, can you try
setting that constant to 0xffffffff and rerunning?

Also, benchmarking the js shell, built from js/src/Makefile.ref on Unixes with a
few directories pulled (config, editline) in addition to what Mozilla builds
pull, gives the baseline for JS perf.  XPCShell has thread-safety enabled,
although its costs have been optimized away in most cases.

/be
I see no point, given the DOM's use of JS_GC and JS_MaybeGC, in having any
last-ditch GC limit for XPConnect-based apps that we host.  Other apps lacking
our DOM code, but using SpiderMonkey and XPConnect, may care; they will have to
make themselves heard, or suit themselves with better GC-calling from their own
code.

/be
Putting this on aviary1.0 radar.

/be
Flags: blocking-aviary1.0+
I spun a build with unlimited gGCSize.  FF dropped down to 170 ms on the test
case, competitive with where xpcshell was.  xpcshell is toast though.  It
dropped from 160 ms to 130 ms but it can't keep that up very long since it never
runs a GC ever.
Is there any chance to get Win build with this patch ? I don't have any
environment to build it by myself.
I wouldn't recommend running with this patch.  It speeds up the testcase, but
bumping the GC limit to a larger yet still reasonable value would as well.  When
things go wrong with no limiter, they can go really wrong.  Multiple lines of
defense are a good thing here.
Flags: blocking1.8a4?
Flags: blocking1.7.x?
Flags: blocking1.8a4? → blocking1.8a4-
Taking -- I have some JS-engine-focused inspiration based on some recent work
done for hardware-optimized Java VMs.

/be
Assignee: general → brendan
Component: DOM: Level 0 → JavaScript Engine
Priority: -- → P1
Target Milestone: --- → mozilla1.8beta
Attached file design notes (obsolete) —
"Stack" here refers to a virtual interpreter LIFO storage structure, not even
the interpreter's stack, and not a machine (thread or process) stack.

/be
Attached file design notes, v2 (obsolete) —
Attachment #159677 - Attachment is obsolete: true
Attached file design notes, v3
Attachment #159693 - Attachment is obsolete: true
Flags: blocking1.7.x? → blocking1.7.x-
The bad behavior for synthetic benchmarks that overallocate memory is no reason
for blocking aviary1.0.  I'll fix this soon enough on the trunk, we'll get it
into a shipping release soon enough.

/be
Status: NEW → ASSIGNED
Flags: blocking-aviary1.0+ → blocking-aviary1.0-
Target Milestone: mozilla1.8beta1 → mozilla1.9alpha
Blocks: 203448
Are we there yet? (soon enough, that is)

;-)
Tested with Firefox 20050621

FF -> Average time (Ubound = 100000): 2650
IE6 -> Average time (Ubound = 100000): 311

So, nope, no improvement in almost 3 years.
Well... On my Athlon 1.8, Gentoo with Gecko/20050605 Firefox/1.0+ it's 

(100 tries, default UBOUND)

Average time (in ms) = 198

Attachment 96729 [details]. 

Opera 8.01:

Average time (in ms) = 256

Konq 3.4:

Average time (in ms) = 158.62
One of you commenters want to chip in more than complaints?  The numbers for
other browsers are welcome, but it would be good to know the machine
configuration(s) used for these tests.

What's more, try varying UBOUND from small to large and plot the results (attach
csv or other data files, something we can chart with spreadsheets).  That would
be a help.

My past comments reflected interest, but I didn't have the time.  This bug is
now targeted for 1.9alpha, which is right.  No more complaining -- supply
signal, not noise.

/be
If I were to answer this the way most people are answered by owners when they
make a request in Bugzilla, I would say, "Hey, it's open source. If you want the
data, you are welcome to get it yourself."
(In reply to comment #35)
> If I were to answer this the way most people are answered by owners when they
> make a request in Bugzilla, I would say, "Hey, it's open source. If you want the
> data, you are welcome to get it yourself."

Of course, if Brendan were to answer in that fashion, he would say: "Hey, it's
open source. If you want better performance, you are welcome to fix this bug
yourself."
Jerry, take it somewhere else or I'll stop playing along.  The whole point here
is cooperation, not whining or grumbling, even (*especially*) if you have a
point.  No one owes you anything, and you don't owe me more data -- but it would
help.  Are you here to help?  I am, but I'm kinda busy and I bet others could
get the data faster/better and maybe even analyze it well.

/be
Flags: testcase?
Checking in regress-157334-01.js;
/cvsroot/mozilla/js/tests/js1_5/String/regress-157334-01.js,v  <-- 
regress-157334-01.js
initial revision: 1.1
Flags: testcase? → testcase+
Blocks: 164421
(In reply to comment #38)
> Checking in regress-157334-01.js;
> /cvsroot/mozilla/js/tests/js1_5/String/regress-157334-01.js,v  <-- 
> regress-157334-01.js
> initial revision: 1.1
> 

Is that trunk, or branch? I know this is a trunk bug, but that was going to be done "soon" in Sept 2004, with "shipping release" soon thereafter. Just wondering which branch this patch landed on in case I missed something. I guess it is too late for 2.0 at this point, but one never knows. Maybe it can get slipped in if there is something else forcing another RC and this can fit in since it was pre-existing open bug and helps performance by 3x. Thanks.
(In reply to comment #39)

That check in is for a test not a fix. This bug is still open as you can see from its status ASSIGNED.
QA Contact: pschwartau → general
Attached patch backup (obsolete) — Splinter Review
Work in progress, eliminate GCF_LOCK as a stored flag, move GCF_MUTABLE into the JSString length member.

/be
Target Milestone: mozilla1.9alpha1 → mozilla1.9 M8
Attached patch backup, v2 (obsolete) — Splinter Review
Attachment #275539 - Attachment is obsolete: true
Comment on attachment 275567 [details] [diff] [review]
backup, v2

>Index: js/src/jsgc.h
> /* GC thing type indexes. */
> #define GCX_OBJECT              0               /* JSObject */
> #define GCX_STRING              1               /* JSString */
> #define GCX_DOUBLE              2               /* jsdouble */
>-#define GCX_MUTABLE_STRING      3               /* JSString that's mutable --
>-                                                   single-threaded only! */
>+#define GCX_SCRIPT              3               /* JSScript */
> #define GCX_FUNCTION            4               /* JSFunction */
...
> 
> /* GC flag definitions, must fit in 8 bits (type index goes in the low bits). */
> #define GCF_TYPEMASK    JS_BITMASK(GCX_NTYPES_LOG2)
> #define GCF_MARK        JS_BIT(GCX_NTYPES_LOG2)
> #define GCF_FINAL       JS_BIT(GCX_NTYPES_LOG2 + 1)
> #define GCF_SYSTEM      JS_BIT(GCX_NTYPES_LOG2 + 2)
>-#define GCF_LOCKSHIFT   (GCX_NTYPES_LOG2 + 3)   /* lock bit shift */
>-#define GCF_LOCK        JS_BIT(GCF_LOCKSHIFT)   /* lock request bit in API */
> 
>-/* Pseudo-flag that modifies GCX_STRING to make GCX_MUTABLE_STRING. */
>-#define GCF_MUTABLE     2
>+/* Request flag bits passed to js_NewGCThing but not stored in memory. */

I am curious what is wrong with with GCF_LOCK given that the storage to store the flag is available in any case? For strings it saves an entry per hash table and the patch from bug 386265 comment 14 assumes this efficiency.
I was just backing up patches, also trying to free a flag bit for a purpose I've satisfied otherwise. So, thanks for the comment -- you are right, no need to lose the lock flag, which saves us a double hashtable entry for primitives.

/be
Attached patch backup, v3 (obsolete) — Splinter Review
I restored GCF_LOCK (my baby from 1996) and removed all the mutable-string farbling, so this should not conflict with the patch for bug 391290. I'll review that bug's patch today.

/be
Attachment #275567 - Attachment is obsolete: true
Igor, I'd appreciate your comments and early review. This is working for simple tests, with very promising results (especially in terms of process memory size staying small); I'll pound on it tonight.

The escape barrier does not compact the local heap when a function returns, instead it hopes for good LIFO allocation patterns and return values. I'll instrument that, too, and clean up the ad-hoc instrumentation.

Also TBD: JS_NewContextLocalHeap -- this patch just bloats every context with a local heap.

/be
Attachment #275818 - Attachment is obsolete: true
Attachment #276584 - Flags: review?(igor)
Oops, must remember to upload the right patch.

/be
Attachment #276584 - Attachment is obsolete: true
Attachment #276586 - Flags: review?(igor)
Attachment #276584 - Flags: review?(igor)
In the patch I think a fallible write barrier can be avoided if escaped things are just marked as such. Then, when the local GC detacts that the amount of escaped things reaches a particular threshold, the whole arena is transfered to the global list.
Transferring the arena on overflow trap to the global heap is a good idea. I did not think of doing that also if there are too many escapees. Will try tomorrow.

/be
Attachment #276586 - Attachment is obsolete: true
Attachment #276606 - Flags: review?(igor)
Attachment #276586 - Flags: review?(igor)
(In reply to comment #49)
> Created an attachment (id=276606) [details]
> important bugfix, plus tracing optimization
> 
> Transferring the arena on overflow trap to the global heap is a good idea. I
> did not think of doing that also if there are too many escapees.

Escape from a frame being popped is not the issue, though -- it's the heap trap that occurs when a local thing (string or number) is stored in a (global, of course) object or root. In such a case, we can try to keep the thing local, but we won't be able to reduce lastLimit to at or below its offset from the first page in the arena.

This will result in more local GCs and freelist-based local GC-thing allocations. It will increase the cost of local GC relative to allocation. That can hurt, at least for the real-nbody.js benchmark (before I added escape trap handling, i.e. fp->lastLimit and js_EscapeBarrier), a lot (order 10x slowdown, I'll post numbers tomorrow).

I'll try it and measure, but I'm worried about pathological performance.

/be
(In reply to comment #50)
> Escape from a frame being popped is not the issue, though -- it's the heap trap
> that occurs when a local thing (string or number) is stored in a (global, of
> course) object or root. In such a case, we can try to keep the thing local, but
> we won't be able to reduce lastLimit to at or below its offset from the first
> page in the arena.

But why use lastLimit and not a flag for local things that became global? If js_GetGCThingFlags is slow for that, then what about using CPU pages for GC arenas so accessing the flag field would be very simple bit manipulation without any ifs. MMgc has page allocation code that can be used for that.

With the flag the only consequence of thing's escape would be decreased amount of local storage until local GC decides to transfer the whole arena to the global GC.
  
> This will result in more local GCs and freelist-based local GC-thing
> allocations. It will increase the cost of local GC relative to allocation. That
> can hurt, at least for the real-nbody.js benchmark (before I added escape trap
> handling, i.e. fp->lastLimit and js_EscapeBarrier), a lot (order 10x slowdown,
> I'll post numbers tomorrow).
> 
> I'll try it and measure, but I'm worried about pathological performance.
> 
> /be
> 

(In reply to comment #51)
> (In reply to comment #50)
> > Escape from a frame being popped is not the issue, though -- it's the heap trap
> > that occurs when a local thing (string or number) is stored in a (global, of
> > course) object or root. In such a case, we can try to keep the thing local, but
> > we won't be able to reduce lastLimit to at or below its offset from the first
> > page in the arena.
> 
> But why use lastLimit and not a flag for local things that became global?

That's why I was trying to liberate GCF_LOCK (comment 43 and comment 45). But I would prefer to minimize change so as to get this performance and memory win into 1.9. We can move to mmap based page allocation in a later patch, if there is time.

[Since I wrote that, hours ago] Thanks for filing this now as bug 392263.

> With the flag the only consequence of thing's escape would be decreased amount
> of local storage until local GC decides to transfer the whole arena to the
> global GC.

What if we don't transfer the whole local heap arena, so some local things are referenced by global things. We will need to unify global and local mark phases or the local GC for this local heap's context will sweep the now-globally-referenced, locally-unreferenced thing.

The way local GC works, we don't want global things pointing to local things, or we may end up putting local things on a global or per-thread freelist. The case we hope is much more typical than transferring the whole arena to the global GC is to "pop" LIFO-allocated things and empty the arena, leave it ready for more local, mostly-LIFO allocations.

/be
Question about dependent strings.

When a local dependent string becomes global, the patch AFAICS does not make global its base. In this case what prevents the base from being collected via local GC?

  
(In reply to comment #53)
> Question about dependent strings.
> 
> When a local dependent string becomes global, the patch AFAICS does not make
> global its base. In this case what prevents the base from being collected via
> local GC?

That's a bug, thanks for pointing it out (I had it in the back of my mind, chewing a hole in my brain ;-) -- new patch shortly.

/be
Attached patch dependent string fixes (obsolete) — Splinter Review
Attachment #276606 - Attachment is obsolete: true
Attachment #276850 - Flags: review?(igor)
Attachment #276606 - Flags: review?(igor)
(In reply to comment #55)
> Created an attachment (id=276850) [details]
> dependent string fixes

Is the idea here is to allow for dependent strings to exists only in the local heap? But then s += string where s is a global variable would not be able to benefit from the patch unless I miss something.
The patch currently does not deal with GC_GLOBAL_WRITE that fails. How that would be addressed?
(In reply to comment #56)
> (In reply to comment #55)
> > Created an attachment (id=276850) [details] [details]
> > dependent string fixes
> 
> Is the idea here is to allow for dependent strings to exists only in the local
> heap? But then s += string where s is a global variable would not be able to
> benefit from the patch unless I miss something.

Any global write of a local value must global-ize the local. The dependent string case could be handled by extending the tracer to keep track of multiple things to move, and that could be worthwhile. But I would like to instrument more and keep the code simple, and see whether this dependent local string written to global case is indeed a hard case.

Also, the result of + is an independent string, not a dependent string. The left operand (s in your example) would be a global dependent string whose base is the global independent string produced by JSOP_ADD, even though string (the right hand side) is local. See the JOF_GLOBALWRITE testing in jsinterp.c.

With more static analysis, we could do better, but that is for ActionMonkey and for later.

(In reply to comment #57)
> The patch currently does not deal with GC_GLOBAL_WRITE that fails. How that
> would be addressed?

About as badly as the js_MakeStringImmutable failing in js_SetSlotThreadSafe, which you filed as bug 363059 :-P. If there's an OOM error under js_LocalGC called from the write barrier, the value written to the global-heap-contained slot will be null. To fix this I would need to make the *OBJ_SET_* macros fallible.

It could be done, but I'm trying to get benchmark results ignoring this little issue, so I can compare with an alternative patch along the lines you suggested of allowing local things to become reachable from globals, flagging them so they do not get collected by a local GC, and merging global and local mark phases.

/be 
(In reply to comment #58)
> (In reply to comment #57)
> > The patch currently does not deal with GC_GLOBAL_WRITE that fails. How that
> > would be addressed?
> 
> About as badly as the js_MakeStringImmutable failing in js_SetSlotThreadSafe,
> which you filed as bug 363059 :-P.

I do not like the idea of infecting more code with that bug.

> If there's an OOM error under js_LocalGC
> called from the write barrier, the value written to the global-heap-contained
> slot will be null. To fix this I would need to make the *OBJ_SET_* macros
> fallible.

An alternative recovery is to make the local arena global on errors. Then js_LocalGC  never fails.

> 
> It could be done, but I'm trying to get benchmark results ignoring this little
> issue,

What if fixing the issue (which I do not see as small one) would affect the benchmarks to alter the conclusion?
(In reply to comment #59)
> I do not like the idea of infecting more code with that bug.

Agreed, and so I won't commit this patch or any that has the null-on-OOM-from-localGC bug. My review request was more for early eyes on code, not to get the patch landed. I'm still debugging it, in fact.

> > If there's an OOM error under js_LocalGC
> > called from the write barrier, the value written to the global-heap-contained
> > slot will be null. To fix this I would need to make the *OBJ_SET_* macros
> > fallible.
> 
> An alternative recovery is to make the local arena global on errors. Then
> js_LocalGC  never fails.

Not bad for a solution that should not affect benchmarks.

> > It could be done, but I'm trying to get benchmark results ignoring this little
> > issue,
> 
> What if fixing the issue (which I do not see as small one) would affect the
> benchmarks to alter the conclusion?

I think you just solved that problem ;-).

/be
This is merged up to top of trunk as of now.

/be
Attachment #276850 - Attachment is obsolete: true
Attachment #276850 - Flags: review?(igor)
This patch improves performance and memory use for the string concatenation benchmark in this bug, and the double-allocating benchmark in bug 161110. So does the last "plan A" patch (attachment 277024 [details] [diff] [review]), which uses moving GC lightly to preserve LIFO local heap management as much as possible.

But for the real-nbody.js test, which jresig hosts and which originated (AFAIK) in the Debian language shootout, this patch makes things a lot slower. That's because it overcollects compared to the stack-like case in the last patch, which can finalize temporaries without first having to mark, when popping a frame.

I'll continue with "plan A", working to remove fallibility in the write barrier.

/be
(In reply to comment #62)

> But for the real-nbody.js test, which jresig hosts and which originated (AFAIK)
> in the Debian language shootout, this patch makes things a lot slower. That's
> because it overcollects compared to the stack-like case in the last patch,
> which can finalize temporaries without first having to mark, when popping a
> frame.

I suspect that what slowed math-intensive benchmarks with the patch is not "ovecollecting" but rather "overscanning". That is the patch makes LocalGCSweep more expensive since now it has scan over escaped things. 

This means that a copy collector for doubles would benefit SM even more then just local heap since copying GC never scans the garbage which calculations with double values produces a lot. It is too bad that due to API compatibility this is not possible for the whole heap and can only be done for a local copy that embeddings do not see. 

Still it should be possible to reduce the scanning overhead within the current API by marking doubles in groups. Then the scanning code can put the whole group to the free list with single check instead of scanning all group members. With bug 392263 implemented such grouping should be straightforward to support.
To Brendan:

The comments in the bug mention few times real-nbody.js benchmark. Where can I find it? I would like to try it out with a patch from bug 400902.
(In reply to comment #64)
> To Brendan:
> 
> The comments in the bug mention few times real-nbody.js benchmark.

See http://ejohn.org/apps/speed/ under Real-World / N-Body. IIRC this test came from the "great language shootout" benchmark-suite at

http://shootout.alioth.debian.org/

/be
Assignee: brendan → igor
Status: ASSIGNED → NEW
http://ejohn.org/blog/javascript-engine-speeds/ is a link to various JS benchmarks that includes nbody one.
I am not working on the bug right now.
Assignee: igor → general
> Re: comment 10 -- I commented on that blog. The code shown there has a couple
> of bugs, and we would not want to round up to powers of two past a certain
> point. We often go linear at a threshold: 2, 4, 8, ... 1024, 2048, 3072, 4096,
> ... although of course you can measure O(n^2) compounding of the linear cost
> function if you try hard enough.

re: http://www.yafla.com/dforbes/String_Concatenation_and_Immutable_Strings_Speeding_Spidermonkey#a412 (in case this isn't already discussed herein)

Fixing the bugs in his implementation and switching to linear-growth at some small threshold seem reasonably, and maybe this is worth developing just to give it a real try against real-world data.  I still think even with small linear growth, at a small threshold, there is a non-trivial footprint cost associated with this idea.
Did the patch from comment #21 ever get any traction?
(In reply to comment #68)
> > Re: comment 10 -- I commented on that blog. The code shown there has a couple
> > of bugs, and we would not want to round up to powers of two past a certain
> > point. We often go linear at a threshold: 2, 4, 8, ... 1024, 2048, 3072, 4096,
> > ... although of course you can measure O(n^2) compounding of the linear cost
> > function if you try hard enough.
> 
> re:
> http://www.yafla.com/dforbes/String_Concatenation_and_Immutable_Strings_Speeding_Spidermonkey#a412
> (in case this isn't already discussed herein)
> 
> Fixing the bugs in his implementation and switching to linear-growth at some
> small threshold seem reasonably, and maybe this is worth developing just to
> give it a real try against real-world data.  I still think even with small
> linear growth, at a small threshold, there is a non-trivial footprint cost
> associated with this idea.
> 

How about you work a patch up - we land it and let it go through talos/couple days of nightlies and back back out?

Any chances in getting this in or should we move the TM?

I can reproduce it in latest Fx3 nightly - average is 67, comparing to 16 in IE& (HTML testcase)
Attached patch hack (obsolete) — Splinter Review
Assignee: general → crowder
Status: NEW → ASSIGNED
Attached patch tiny cleanupSplinter Review
diff -b next.
Attachment #308464 - Attachment is obsolete: true
the way the old else clause was written did not agree with house-style, so I cleaned it up, but the majority of the changes are whitespace.
The spacing still looks wrong, and not just the spacing -- how about a plain (space-sensitive, I mean, but still cvs diff -pu8) diff?

More important, and I've said this before, exponentiation explodes, so we should switch to linear growth at some threshold. Doubling is wrong at large size.

/be
Brendan:  You don't like the diff from comment #73, as a diff?

This needs more testing and it's buggy, too (makes the assumption that power-of-2 alloc happened on creation of the first string, which is - I think - false), but worse it doesn't seem to yield much of a bonus in perf-tests, if anything.  I never sought review on this patch for that reason.
With jemalloc we definitely don't want to second-guess realloc's underlying size classification. This patch is not the way, so let's put energy into other avenues.

/be
Yep, agreed, which is why I abandoned work on it after trying the easy hack.  I think perhaps we should WONTFIX this ancient bug (the summary is sort of bogus anyway).
We should not WONTFIX this bug without re-running relevant benchmarks from it, noting the performance, and adjusting the summary if the bug should stay open. Just because one possible patch is invalid does not mean the bug should be marked WONTFIX -- that would be a bug that's a real malfunction by some measure, but that we believe does not matter, or is worth the costs for other reasons.

/be
Throwing this one back in the pool.
Assignee: crowder → general
Status: ASSIGNED → NEW
Priority: P1 → --
Target Milestone: mozilla1.9alpha8 → ---
(We get clobbered by IE9 on this: we're bimodal at either 70 or 250ms, and they are sub-1ms.  Seems like they might be doing some analysis!)
A constant propagation that works across branches/joins would pick this up.  Bug 608346 does this, but doesn't constant fold string concatenation (easy to fix).  How does the comparison look when the strings are not constant?
If I make the function look like

var str1='qwertyuiop...';

function test()
{
  var str2='qwertyu...';
  var str;
                                        
  for(var i=0; i<=UBOUND; i++)
    str = str1 + str2;

  return str;
}

IE9 looks like 0-2 ms.  If I make it return str[40], it goes to about 2-4ms.

If I make the inner loop be
  str = str1[i % str1.length] + str2;
in addition to the other changes, IE9 is about 5-9ms, and we cut down to 30 -- because we're copying half as much string data, if I have to guess.

We may have to accept that they just have fast strings. :-)
Our problem here is that even though we make a rope when doing str1 + str2, we also have to allocate a ballast to hold the flattened string on every such concatenation, which is 4KB for these strings.  If I comment out the code allocating this ballast my time on a microbenchmark adding these qwerty strings goes from 40ms to .5ms.  Blocking on bug 608776, which should fix this.
Depends on: 608776
This has been resolved via bug 609440, which made JS_GetStringChars() fallible, removing the requirement of maintaining a buffer alongside the rope.

The testcase from Comment 13 currently reports 3.5ms average time on tracemonkey linux-x64-opt. Closing.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Hi, I'm on Windows XP.
Setting UBound = 100000, running 30 times

Using latest nightly (Gecko/20110113 Firefox/4.0b10pre) I get
Average time (in ms) = 6.36

Using IE8 I get
Average time (in ms) = 92.26

And using Chrome 8.0.552.237 I get
Average time (in ms) = 2.0

So, 3x slower than Chrome, something to investigate?
Sorry to post on a closed bug.
3x slower than chrome we should investigate in a new bug...
No longer blocks: 625615
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: