Open Bug 592786 Opened 15 years ago Updated 3 months ago

Peacekeeper test arraySplice twice as slow in Firefox compared to Chrome

Categories

(Core :: JavaScript Engine, defect)

x86_64
All
defect

Tracking

()

People

(Reporter: bugs, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: perf)

Attachments

(3 files)

This presumably implies a slower splice operation in general, I just didn't want to jump to conclusions. Mozilla/5.0 (X11; Linux x86_64; rv:2.0b6pre) Gecko/20100901 Firefox/4.0b6pre vs Google Chrome 7 for Linux Chrome7 FF nightly arraySplice |7142857.145 | 191278.406 The peacekeeper test was from the harness, which appears to create a new array, push 100,000 elements in a loop, then call .splice(20, 20); Commenting out the splice at bz's suggestion yielded almost identical results in Chrome and FF - so presumably that eliminates the array creation and initialisation. Once again filing in bmo due to the bug in bmo causing bugs filed against Core in a non-JS browser (all I have access to right now) to error with: You must select/enter a blocking1.9.1. Please press Back and try again. This should go in Core/Javascript Engine most likely
Assignee: marcia → general
Component: Bugzilla: Keywords & Components → JavaScript Engine
Product: mozilla.org → Core
QA Contact: timeless → general
Version: other → Trunk
So, why are you filing a JS bug here?
Because with JS not working in his browser, it's the only place he can file, per his second-last paragraph. Nemo should probably get a backup browser, though. :-)
Degrading gracefully is not hard, but if the BMO people insist on not fixing this, I'll just defer filing these until I'm at my machine where I can do so...
OK, so 96% of the time is in array_splice, and mostly in the loop that reimplements memmove. Is there a reason to NOT use memmove here?
Hah, that loop used to have a special case for holes (to maintain some terrible property of arrays that Andreas has since removed). Yes, by all means, let's memmove! Side note: even with memmove, v8 has 32-bit values, so we will be moving 2x more bytes.
If I try that, at least on Mac, then the performance gets _worse_ by a factor of 2 or so, and now 98% of the time is spent in __memcpy under array_splice.
So yeah... for the actual copying, we're 2x slower than V8, almost exactly (as long as we avoid calling memmove). The way the benchmark is structured, that can translate into arbitrarily large differences in the score, depending on the hardware.
On my machine this doubles our score on this benchmark... but we still don't end up anywhere close to Chrome's number. I still claim this is a bug in the benchmark, but their op-count-limiter might affect things.
Oh, peacekeeper disables the op limit in non-IE browsers! In which case, the basic structure of this test is that we run 5000 iterations at some speed and however many we can manage after that at ridonkulous speed (e.g. running for two seconds gives me 140000 iterations; 96% of the time is spent in the 5000 at the beginning. So doubling the score just meant that we made the "real work" part of this a few percent faster. If we doubled the speed of the "real work" part, our score would go up by another factor of 12 or so...
Great benchmark.
Note: with that last patch the assembly mac gcc generates is: 0x3400e6 movapd xmm0, [qword ptr [edx+eax*8] Loop start[5] 0x3400eb movapd [qword ptr [ecx+eax*8], xmm0 0x3400f0 add eax, byte 2 0x3400f3 cmp esi, eax 0x3400f5 ja 0x003400e6 Loop end[5] but this is very sensitive... I tried rearranging the loop to just be a |while (i)| loop, and suddenly gcc loop-induced and generated this assembly: 0x3400d9 movapd xmm0, [esi] Loop start[5] 0x3400dd movapd [ecx], xmm0 0x3400e1 add esi, byte 16 0x3400e4 add ecx, byte 16 0x3400e7 sub edi, byte 2 0x3400ea jnz 0x003400d9 Loop end[5] Which happens to be slower, in my measurements; presumably the two adds are slower than the cmp.... or something. Of course who knows what a different gcc version, or msvc, would do with this. _why_ does memmove have to suck? :(
memmove doesn't have to suck, it just has to sort out the overlap vs. non-overlap and direction cases and Duff-device/brewery-tour its way around the runt pieces before aligning for larger units. Anyone disassemble it on Mac to see what it is doing? /be
I tried that; I can't seem to step into it in gdb in an opt build, and shark claims the addresses are all confused... If its disassembly is still correct, which may be, though the blaming of samples is clearly off, looks like there's some |rep movsw| in there, but each time it's called is followed by a jmp to what looks like a loop copying through al or something. There's also, though I dunno how control gets there, some movdqa stuff (using 4 or more XMM registers; when I was testing this I found that using two gave worse performance than the patch I attached!)....
One other note: if we had int-specialized arrays, we'd just be moving less data and I suspect would be equal to V8's speed here.
OK, I did some more measurements. The first time after pageload that I run this test in V8 it runs in about the time it takes us to run it. A profile shows that most of their time is spent in memmove there. The second and subsequent times I run the test without reloading I get "ludicrous speed" numbers in V8. I'd really like to understand what's going on there.
V8 is open source -- http://code.google.com/p/v8/. Are they memoizing results, or something like that? /be
> V8 is open source -- http://code.google.com/p/v8/ Yes, and I've been looking at the source, but in the source they seem to just copy, no magic...
Oh, and I tried the attached patch inside the harness, instead of on my standalone testcase, and it seems to speed up the in-harness array-splice test by almost 2x. But doesn't change the overall array score much....
How about making a shell version of the benchmark and then debugging/instrumenting a V8 shell to see if they do in fact run the memmove every time?
More precisely, it makes the array-splice test, run on its own using the harness, 2x faster. I don't know what happens when it's run as part of the overall array tests.
> How about making a shell version of the benchmark I was actually in the middle of that, yes. Nemo claims that old versions of chrome don't have the weird-fast behavior, so I might just be able to bisect for when it appeared.
bz, Is it possible that in your standalone test, the condition that guards the SSE2 code (line 2402 of your patch) "offset == (unsigned(dstbeg) & 15)" is satisfied while in the overall test, it is not? For example, the offsets could be off by 8. It would be helpful to post your standalone test. - moh
Attached file Shell benchmark
TM output for this benchmark without attached patch: 361 -- 277008 364 -- 274725 365 -- 273972 363 -- 275482 364 -- 274725 377 -- 265251 With attached patch: 237 -- 421940 248 -- 403225 258 -- 387596 256 -- 390625 260 -- 384615 253 -- 395256 V8 shell output: 114 -- 877192 113 -- 884955 96 -- 1041666 3 -- 33333333 3 -- 33333333 1 -- 100000000
Moh, the overall test is removing 20 Values from an array in this subtest, just like my standalone benchmark. Values are 8 bytes. So srcbeg and dstbeg will be 160 bytes apart, and hence in this test this condition is always true. In any case, the issue there was that I was misreading the benchmark. "Combined" doesn't mean all the array tests together; it's just a different test. So this patch really does speed up splice quite a bit. Now looking into what the deal is with v8's shell.
Aha! I've found the relevant code: http://code.google.com/p/v8/source/browse/trunk/src/builtins.cc?r=5388#650 And the code of interest is at http://code.google.com/p/v8/source/browse/trunk/src/builtins.cc?r=5388#731 When v8 is really fast, trim_array ends up true (e.g. actual_start == 20, item_count == 0, len == 100000, actual_delete_count == 20). And in that case what v8 does is to memmove the first 20 elements into the gap, call LeftTrimFixedArray(elms, 20) and then array->set_elemnts(elms). I would assume that times when v8 is slow the !Heap::lo_space()->Contains(elms) test fails.
An it looks like the details of that have to do with their heap and gc setup, but the upshot is that sometimes they can left-trim (and that's super-fast in this case), and sometimes they can't.
Confirming that SSE2 vectorized memcpy is used through memmove by V8 on the test case in the patch on Windows. __VEC_memcpy is part of the libc[mt].
Yeah, to be clear I was comparing, profiling, etc on mac. And the memmove I was hitting doesn't seem to be vectorized.... We should compare this patch to the old code to memmove on at least Windows and Linux too, before making any changes to the copying.
Blocks: 609835
No longer blocks: 609835
Not sure if 'slice' (not splice) also falls under this bug's scope or if I should file a new bug...same issue with slow performance. My particular issue is I would like to take advantage of the new JS native typed arrays (used in conjunction with buffering downloaded data from sockets and en/decrypting for SSH) but they're too slow compared to using just plain old strings. On my machine (mac, 2Ghz, dual core): array concatenation (1,000x, ~16kb data): 744 ms string concatenation (1,000x, ~16kb data): 86 ms array concatenation (100x, growing by ~16kb data at a time): 1632 ms string concatenation (100x, growing by ~16kb data): 0 ms array slice (1,000x, ~16kb data): 341 ms string substring (1,000x, ~16kb data): 1 ms I have a rudimentary benchmark here (click the Run Benchmarks button): http://firessh.mozdev.org/developers.html
(In reply to comment #29) > > My particular issue is I would like to take advantage of the new JS native > typed arrays (used in conjunction with buffering downloaded data from sockets > and en/decrypting for SSH) but they're too slow compared to using just plain > old strings. String concatenation is done lazily. So if you want to measure the speed of string concatenation realistically you should probably do some stuff with the string afterwards, otherwise you'll get over-optimistic results.
Ok, I've updated the benchmark to do a simple c.charAt(c.length - 1) after each concatenation. This adds ~100ms to the string results but it's still wildly faster than arrays. Not to compare it to Chrome too much, but on Chrome the numbers are much more reasonable.
The slice() test could in fact be this same issue, depending on how the slice() call is done. Please file a separate bug on whatever the typed array vs string issue is, with a clear testcase, and cc me? Typed arrays and native JS arrays are _very_ different beasts, so this bug has nothing to do with typed arrays.
anyone want to take this?
(In reply to comment #32) > Please file a separate bug on whatever the typed array vs string issue is, with > a clear testcase, and cc me? Typed arrays and native JS arrays are _very_ > different beasts, so this bug has nothing to do with typed arrays. Ah, to clarify - I would *like* to use typed arrays to take advantage of its speed (and to avoid large conversions of 8-bit numbers <-> strings) but b/c of the speed issue it's faster to use string as arrays. My little benchmark doesn't involved typed arrays actually. I'm just comparing regular JS arrays. i.e. var foo = [];
> but b/c of the speed issue it's faster to use string as arrays. And my point was that whatever speed issue you're seeing there you should file a bug on, with an example showing the problem.
bug 639979 opened as an offshoot to this bug.
Assignee: general → nobody
Keywords: perf
OS: Linux → All
Summary: Peacekeeper test arraySplice 37 times slower in Firefox → Peacekeeper test arraySplice 37 times slower in Firefox compared to Chrome
Severity: normal → S3

On the testcase in comment #23:

Nightly: 26ms-32ms
Chrome: 18ms

So about 50% slower now.

Noticed this bug from twelve years ago in my open bug list. Updated title based on comment #38. Haven't retested.

Summary: Peacekeeper test arraySplice 37 times slower in Firefox compared to Chrome → Peacekeeper test arraySplice twice as slow in Firefox compared to Chrome
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: