Open
Bug 592786
Opened 15 years ago
Updated 3 months ago
Peacekeeper test arraySplice twice as slow in Firefox compared to Chrome
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
NEW
People
(Reporter: bugs, Unassigned)
References
(Blocks 1 open bug, )
Details
(Keywords: perf)
Attachments
(3 files)
|
2.03 KB,
patch
|
Details | Diff | Splinter Review | |
|
616 bytes,
application/x-javascript
|
Details | |
|
227.38 KB,
image/png
|
Details |
This presumably implies a slower splice operation in general, I just didn't want to jump to conclusions.
Mozilla/5.0 (X11; Linux x86_64; rv:2.0b6pre) Gecko/20100901 Firefox/4.0b6pre
vs
Google Chrome 7 for Linux
Chrome7 FF nightly
arraySplice |7142857.145 | 191278.406
The peacekeeper test was from the harness, which appears to create a new array, push 100,000 elements in a loop, then call .splice(20, 20);
Commenting out the splice at bz's suggestion yielded almost identical results in Chrome and FF - so presumably that eliminates the array creation and initialisation.
Once again filing in bmo due to the bug in bmo causing bugs filed against Core in a non-JS browser (all I have access to right now) to error with:
You must select/enter a blocking1.9.1.
Please press Back and try again.
This should go in Core/Javascript Engine most likely
Updated•15 years ago
|
Assignee: marcia → general
Component: Bugzilla: Keywords & Components → JavaScript Engine
Product: mozilla.org → Core
QA Contact: timeless → general
Version: other → Trunk
Comment 1•15 years ago
|
||
So, why are you filing a JS bug here?
Comment 2•15 years ago
|
||
Because with JS not working in his browser, it's the only place he can file, per his second-last paragraph. Nemo should probably get a backup browser, though. :-)
Degrading gracefully is not hard, but if the BMO people insist on not fixing this, I'll just defer filing these until I'm at my machine where I can do so...
Comment 4•15 years ago
|
||
OK, so 96% of the time is in array_splice, and mostly in the loop that reimplements memmove. Is there a reason to NOT use memmove here?
Comment 5•15 years ago
|
||
Hah, that loop used to have a special case for holes (to maintain some terrible property of arrays that Andreas has since removed). Yes, by all means, let's memmove!
Side note: even with memmove, v8 has 32-bit values, so we will be moving 2x more bytes.
Comment 6•15 years ago
|
||
If I try that, at least on Mac, then the performance gets _worse_ by a factor of 2 or so, and now 98% of the time is spent in __memcpy under array_splice.
Comment 7•15 years ago
|
||
So yeah... for the actual copying, we're 2x slower than V8, almost exactly (as long as we avoid calling memmove).
The way the benchmark is structured, that can translate into arbitrarily large differences in the score, depending on the hardware.
Comment 8•15 years ago
|
||
On my machine this doubles our score on this benchmark... but we still don't end up anywhere close to Chrome's number. I still claim this is a bug in the benchmark, but their op-count-limiter might affect things.
Comment 9•15 years ago
|
||
Oh, peacekeeper disables the op limit in non-IE browsers!
In which case, the basic structure of this test is that we run 5000 iterations at some speed and however many we can manage after that at ridonkulous speed (e.g. running for two seconds gives me 140000 iterations; 96% of the time is spent in the 5000 at the beginning. So doubling the score just meant that we made the "real work" part of this a few percent faster. If we doubled the speed of the "real work" part, our score would go up by another factor of 12 or so...
Comment 10•15 years ago
|
||
Great benchmark.
Comment 11•15 years ago
|
||
Note: with that last patch the assembly mac gcc generates is:
0x3400e6 movapd xmm0, [qword ptr [edx+eax*8] Loop start[5]
0x3400eb movapd [qword ptr [ecx+eax*8], xmm0
0x3400f0 add eax, byte 2
0x3400f3 cmp esi, eax
0x3400f5 ja 0x003400e6 Loop end[5]
but this is very sensitive... I tried rearranging the loop to just be a |while (i)| loop, and suddenly gcc loop-induced and generated this assembly:
0x3400d9 movapd xmm0, [esi] Loop start[5]
0x3400dd movapd [ecx], xmm0
0x3400e1 add esi, byte 16
0x3400e4 add ecx, byte 16
0x3400e7 sub edi, byte 2
0x3400ea jnz 0x003400d9 Loop end[5]
Which happens to be slower, in my measurements; presumably the two adds are slower than the cmp.... or something.
Of course who knows what a different gcc version, or msvc, would do with this.
_why_ does memmove have to suck? :(
Comment 12•15 years ago
|
||
memmove doesn't have to suck, it just has to sort out the overlap vs. non-overlap and direction cases and Duff-device/brewery-tour its way around the runt pieces before aligning for larger units. Anyone disassemble it on Mac to see what it is doing?
/be
Comment 13•15 years ago
|
||
I tried that; I can't seem to step into it in gdb in an opt build, and shark claims the addresses are all confused...
If its disassembly is still correct, which may be, though the blaming of samples is clearly off, looks like there's some |rep movsw| in there, but each time it's called is followed by a jmp to what looks like a loop copying through al or something.
There's also, though I dunno how control gets there, some movdqa stuff (using 4 or more XMM registers; when I was testing this I found that using two gave worse performance than the patch I attached!)....
Comment 14•15 years ago
|
||
One other note: if we had int-specialized arrays, we'd just be moving less data and I suspect would be equal to V8's speed here.
Comment 15•15 years ago
|
||
OK, I did some more measurements. The first time after pageload that I run this test in V8 it runs in about the time it takes us to run it. A profile shows that most of their time is spent in memmove there.
The second and subsequent times I run the test without reloading I get "ludicrous speed" numbers in V8. I'd really like to understand what's going on there.
Comment 16•15 years ago
|
||
V8 is open source -- http://code.google.com/p/v8/. Are they memoizing results, or something like that?
/be
Comment 17•15 years ago
|
||
> V8 is open source -- http://code.google.com/p/v8/
Yes, and I've been looking at the source, but in the source they seem to just copy, no magic...
Comment 18•15 years ago
|
||
Oh, and I tried the attached patch inside the harness, instead of on my standalone testcase, and it seems to speed up the in-harness array-splice test by almost 2x. But doesn't change the overall array score much....
Comment 19•15 years ago
|
||
How about making a shell version of the benchmark and then debugging/instrumenting a V8 shell to see if they do in fact run the memmove every time?
Comment 20•15 years ago
|
||
More precisely, it makes the array-splice test, run on its own using the harness, 2x faster. I don't know what happens when it's run as part of the overall array tests.
Comment 21•15 years ago
|
||
> How about making a shell version of the benchmark
I was actually in the middle of that, yes. Nemo claims that old versions of chrome don't have the weird-fast behavior, so I might just be able to bisect for when it appeared.
Comment 22•15 years ago
|
||
bz,
Is it possible that in your standalone test, the condition that guards the SSE2 code (line 2402 of your patch) "offset == (unsigned(dstbeg) & 15)" is satisfied while in the overall test, it is not? For example, the offsets could be off by 8.
It would be helpful to post your standalone test.
- moh
Comment 23•15 years ago
|
||
TM output for this benchmark without attached patch:
361 -- 277008
364 -- 274725
365 -- 273972
363 -- 275482
364 -- 274725
377 -- 265251
With attached patch:
237 -- 421940
248 -- 403225
258 -- 387596
256 -- 390625
260 -- 384615
253 -- 395256
V8 shell output:
114 -- 877192
113 -- 884955
96 -- 1041666
3 -- 33333333
3 -- 33333333
1 -- 100000000
Comment 24•15 years ago
|
||
Moh, the overall test is removing 20 Values from an array in this subtest, just like my standalone benchmark. Values are 8 bytes. So srcbeg and dstbeg will be 160 bytes apart, and hence in this test this condition is always true.
In any case, the issue there was that I was misreading the benchmark. "Combined" doesn't mean all the array tests together; it's just a different test. So this patch really does speed up splice quite a bit.
Now looking into what the deal is with v8's shell.
Comment 25•15 years ago
|
||
Aha! I've found the relevant code: http://code.google.com/p/v8/source/browse/trunk/src/builtins.cc?r=5388#650
And the code of interest is at http://code.google.com/p/v8/source/browse/trunk/src/builtins.cc?r=5388#731
When v8 is really fast, trim_array ends up true (e.g. actual_start == 20, item_count == 0, len == 100000, actual_delete_count == 20). And in that case what v8 does is to memmove the first 20 elements into the gap, call LeftTrimFixedArray(elms, 20) and then array->set_elemnts(elms).
I would assume that times when v8 is slow the !Heap::lo_space()->Contains(elms) test fails.
Comment 26•15 years ago
|
||
An it looks like the details of that have to do with their heap and gc setup, but the upshot is that sometimes they can left-trim (and that's super-fast in this case), and sometimes they can't.
Comment 27•15 years ago
|
||
Confirming that SSE2 vectorized memcpy is used through memmove by V8 on the test case in the patch on Windows. __VEC_memcpy is part of the libc[mt].
Comment 28•15 years ago
|
||
Yeah, to be clear I was comparing, profiling, etc on mac. And the memmove I was hitting doesn't seem to be vectorized....
We should compare this patch to the old code to memmove on at least Windows and Linux too, before making any changes to the copying.
Comment 29•15 years ago
|
||
Not sure if 'slice' (not splice) also falls under this bug's scope or if I should file a new bug...same issue with slow performance.
My particular issue is I would like to take advantage of the new JS native typed arrays (used in conjunction with buffering downloaded data from sockets and en/decrypting for SSH) but they're too slow compared to using just plain old strings.
On my machine (mac, 2Ghz, dual core):
array concatenation (1,000x, ~16kb data): 744 ms
string concatenation (1,000x, ~16kb data): 86 ms
array concatenation (100x, growing by ~16kb data at a time): 1632 ms
string concatenation (100x, growing by ~16kb data): 0 ms
array slice (1,000x, ~16kb data): 341 ms
string substring (1,000x, ~16kb data): 1 ms
I have a rudimentary benchmark here (click the Run Benchmarks button): http://firessh.mozdev.org/developers.html
Comment 30•15 years ago
|
||
(In reply to comment #29)
>
> My particular issue is I would like to take advantage of the new JS native
> typed arrays (used in conjunction with buffering downloaded data from sockets
> and en/decrypting for SSH) but they're too slow compared to using just plain
> old strings.
String concatenation is done lazily. So if you want to measure the speed of string concatenation realistically you should probably do some stuff with the string afterwards, otherwise you'll get over-optimistic results.
Comment 31•15 years ago
|
||
Ok, I've updated the benchmark to do a simple c.charAt(c.length - 1) after each concatenation. This adds ~100ms to the string results but it's still wildly faster than arrays.
Not to compare it to Chrome too much, but on Chrome the numbers are much more reasonable.
Comment 32•15 years ago
|
||
The slice() test could in fact be this same issue, depending on how the slice() call is done.
Please file a separate bug on whatever the typed array vs string issue is, with a clear testcase, and cc me? Typed arrays and native JS arrays are _very_ different beasts, so this bug has nothing to do with typed arrays.
Comment 33•15 years ago
|
||
anyone want to take this?
Comment 34•15 years ago
|
||
(In reply to comment #32)
> Please file a separate bug on whatever the typed array vs string issue is, with
> a clear testcase, and cc me? Typed arrays and native JS arrays are _very_
> different beasts, so this bug has nothing to do with typed arrays.
Ah, to clarify - I would *like* to use typed arrays to take advantage of its speed (and to avoid large conversions of 8-bit numbers <-> strings) but b/c of the speed issue it's faster to use string as arrays. My little benchmark doesn't involved typed arrays actually. I'm just comparing regular JS arrays. i.e. var foo = [];
Comment 35•15 years ago
|
||
> but b/c of the speed issue it's faster to use string as arrays.
And my point was that whatever speed issue you're seeing there you should file a bug on, with an example showing the problem.
Comment 36•15 years ago
|
||
bug 639979 opened as an offshoot to this bug.
| Assignee | ||
Updated•11 years ago
|
Assignee: general → nobody
Updated•9 years ago
|
Keywords: perf
OS: Linux → All
Summary: Peacekeeper test arraySplice 37 times slower in Firefox → Peacekeeper test arraySplice 37 times slower in Firefox compared to Chrome
Updated•3 years ago
|
Severity: normal → S3
Comment 38•1 year ago
•
|
||
| Reporter | ||
Comment 39•3 months ago
|
||
Noticed this bug from twelve years ago in my open bug list. Updated title based on comment #38. Haven't retested.
Summary: Peacekeeper test arraySplice 37 times slower in Firefox compared to Chrome → Peacekeeper test arraySplice twice as slow in Firefox compared to Chrome
You need to log in
before you can comment on or make changes to this bug.
Description
•