Open Bug 592786 Opened 15 years ago Updated 5 months ago

Peacekeeper test arraySplice twice as slow in Firefox compared to Chrome

Categories

(Core :: JavaScript Engine, defect)

Product:

Component:

Platform:

x86_64

All

Type:

defect

Priority:

Not set

Severity:

S3

Tracking

()

Status:

NEW

People

(Reporter: bugs, Unassigned)

References

(Blocks 1 open bug,
URL
)

Details

(Keywords: perf)

Attachments

(3 files)

proof-of-concept hack 15 years ago Boris Zbarsky [:bzbarsky] 2.03 KB, patch		Details \| Diff \| Splinter Review
Shell benchmark 15 years ago Boris Zbarsky [:bzbarsky] 616 bytes, application/x-javascript		Details
Vectorized memcpy profile 15 years ago Moh Haghighat 227.38 KB, image/png		Details

Reporter

Description

•

15 years ago

This presumably implies a slower splice operation in general, I just didn't want to jump to conclusions. Mozilla/5.0 (X11; Linux x86_64; rv:2.0b6pre) Gecko/20100901 Firefox/4.0b6pre vs Google Chrome 7 for Linux Chrome7 FF nightly arraySplice |7142857.145 | 191278.406 The peacekeeper test was from the harness, which appears to create a new array, push 100,000 elements in a loop, then call .splice(20, 20); Commenting out the splice at bz's suggestion yielded almost identical results in Chrome and FF - so presumably that eliminates the array creation and initialisation. Once again filing in bmo due to the bug in bmo causing bugs filed against Core in a non-JS browser (all I have access to right now) to error with: You must select/enter a blocking1.9.1. Please press Back and try again. This should go in Core/Javascript Engine most likely

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

15 years ago

Assignee: marcia → general

Component: Bugzilla: Keywords & Components → JavaScript Engine

Product: mozilla.org → Core

QA Contact: timeless → general

Version: other → Trunk

Reed Loden [:reed]

Comment 1

•

15 years ago

So, why are you filing a JS bug here?

Mike Shaver (:shaver emeritus)

Comment 2

•

15 years ago

Because with JS not working in his browser, it's the only place he can file, per his second-last paragraph. Nemo should probably get a backup browser, though. :-)

Reporter

Comment 3

•

15 years ago

Degrading gracefully is not hard, but if the BMO people insist on not fixing this, I'll just defer filing these until I'm at my machine where I can do so...

Boris Zbarsky [:bzbarsky]

Comment 4

•

15 years ago

OK, so 96% of the time is in array_splice, and mostly in the loop that reimplements memmove. Is there a reason to NOT use memmove here?

Luke Wagner [:luke]

Comment 5

•

15 years ago

Hah, that loop used to have a special case for holes (to maintain some terrible property of arrays that Andreas has since removed). Yes, by all means, let's memmove! Side note: even with memmove, v8 has 32-bit values, so we will be moving 2x more bytes.

Boris Zbarsky [:bzbarsky]

Comment 6

•

15 years ago

If I try that, at least on Mac, then the performance gets _worse_ by a factor of 2 or so, and now 98% of the time is spent in __memcpy under array_splice.

Boris Zbarsky [:bzbarsky]

Comment 7

•

15 years ago

So yeah... for the actual copying, we're 2x slower than V8, almost exactly (as long as we avoid calling memmove). The way the benchmark is structured, that can translate into arbitrarily large differences in the score, depending on the hardware.

Boris Zbarsky [:bzbarsky]

Comment 8

•

15 years ago

Attached patch proof-of-concept hack — Details — Splinter Review

On my machine this doubles our score on this benchmark... but we still don't end up anywhere close to Chrome's number. I still claim this is a bug in the benchmark, but their op-count-limiter might affect things.

Boris Zbarsky [:bzbarsky]

Comment 9

•

15 years ago

Oh, peacekeeper disables the op limit in non-IE browsers! In which case, the basic structure of this test is that we run 5000 iterations at some speed and however many we can manage after that at ridonkulous speed (e.g. running for two seconds gives me 140000 iterations; 96% of the time is spent in the 5000 at the beginning. So doubling the score just meant that we made the "real work" part of this a few percent faster. If we doubled the speed of the "real work" part, our score would go up by another factor of 12 or so...

Andreas Gal :gal

Comment 10

•

15 years ago

Great benchmark.

Boris Zbarsky [:bzbarsky]

Comment 11

•

15 years ago

Note: with that last patch the assembly mac gcc generates is: 0x3400e6 movapd xmm0, [qword ptr [edx+eax*8] Loop start[5] 0x3400eb movapd [qword ptr [ecx+eax*8], xmm0 0x3400f0 add eax, byte 2 0x3400f3 cmp esi, eax 0x3400f5 ja 0x003400e6 Loop end[5] but this is very sensitive... I tried rearranging the loop to just be a |while (i)| loop, and suddenly gcc loop-induced and generated this assembly: 0x3400d9 movapd xmm0, [esi] Loop start[5] 0x3400dd movapd [ecx], xmm0 0x3400e1 add esi, byte 16 0x3400e4 add ecx, byte 16 0x3400e7 sub edi, byte 2 0x3400ea jnz 0x003400d9 Loop end[5] Which happens to be slower, in my measurements; presumably the two adds are slower than the cmp.... or something. Of course who knows what a different gcc version, or msvc, would do with this. _why_ does memmove have to suck? :(

Brendan Eich [:brendan]

Comment 12

•

15 years ago

memmove doesn't have to suck, it just has to sort out the overlap vs. non-overlap and direction cases and Duff-device/brewery-tour its way around the runt pieces before aligning for larger units. Anyone disassemble it on Mac to see what it is doing? /be

Boris Zbarsky [:bzbarsky]

Comment 13

•

15 years ago

I tried that; I can't seem to step into it in gdb in an opt build, and shark claims the addresses are all confused... If its disassembly is still correct, which may be, though the blaming of samples is clearly off, looks like there's some |rep movsw| in there, but each time it's called is followed by a jmp to what looks like a loop copying through al or something. There's also, though I dunno how control gets there, some movdqa stuff (using 4 or more XMM registers; when I was testing this I found that using two gave worse performance than the patch I attached!)....

Boris Zbarsky [:bzbarsky]

Comment 14

•

15 years ago

One other note: if we had int-specialized arrays, we'd just be moving less data and I suspect would be equal to V8's speed here.

Boris Zbarsky [:bzbarsky]

Comment 15

•

15 years ago

OK, I did some more measurements. The first time after pageload that I run this test in V8 it runs in about the time it takes us to run it. A profile shows that most of their time is spent in memmove there. The second and subsequent times I run the test without reloading I get "ludicrous speed" numbers in V8. I'd really like to understand what's going on there.

Brendan Eich [:brendan]

Comment 16

•

15 years ago

V8 is open source -- http://code.google.com/p/v8/. Are they memoizing results, or something like that? /be

Boris Zbarsky [:bzbarsky]

Comment 17

•

15 years ago

> V8 is open source -- http://code.google.com/p/v8/ Yes, and I've been looking at the source, but in the source they seem to just copy, no magic...

Boris Zbarsky [:bzbarsky]

Comment 18

•

15 years ago

Oh, and I tried the attached patch inside the harness, instead of on my standalone testcase, and it seems to speed up the in-harness array-splice test by almost 2x. But doesn't change the overall array score much....

David Mandelin [:dmandelin]

Comment 19

•

15 years ago

How about making a shell version of the benchmark and then debugging/instrumenting a V8 shell to see if they do in fact run the memmove every time?

Boris Zbarsky [:bzbarsky]

Comment 20

•

15 years ago

More precisely, it makes the array-splice test, run on its own using the harness, 2x faster. I don't know what happens when it's run as part of the overall array tests.

Boris Zbarsky [:bzbarsky]

Comment 21

•

15 years ago

> How about making a shell version of the benchmark I was actually in the middle of that, yes. Nemo claims that old versions of chrome don't have the weird-fast behavior, so I might just be able to bisect for when it appeared.

Comment 22

•

15 years ago

bz, Is it possible that in your standalone test, the condition that guards the SSE2 code (line 2402 of your patch) "offset == (unsigned(dstbeg) & 15)" is satisfied while in the overall test, it is not? For example, the offsets could be off by 8. It would be helpful to post your standalone test. - moh

Boris Zbarsky [:bzbarsky]

Comment 23

•

15 years ago

Attached file Shell benchmark — Details

TM output for this benchmark without attached patch: 361 -- 277008 364 -- 274725 365 -- 273972 363 -- 275482 364 -- 274725 377 -- 265251 With attached patch: 237 -- 421940 248 -- 403225 258 -- 387596 256 -- 390625 260 -- 384615 253 -- 395256 V8 shell output: 114 -- 877192 113 -- 884955 96 -- 1041666 3 -- 33333333 3 -- 33333333 1 -- 100000000

Boris Zbarsky [:bzbarsky]

Comment 24

•

15 years ago

Moh, the overall test is removing 20 Values from an array in this subtest, just like my standalone benchmark. Values are 8 bytes. So srcbeg and dstbeg will be 160 bytes apart, and hence in this test this condition is always true. In any case, the issue there was that I was misreading the benchmark. "Combined" doesn't mean all the array tests together; it's just a different test. So this patch really does speed up splice quite a bit. Now looking into what the deal is with v8's shell.

Boris Zbarsky [:bzbarsky]

Comment 25

•

15 years ago

Aha! I've found the relevant code: http://code.google.com/p/v8/source/browse/trunk/src/builtins.cc?r=5388#650 And the code of interest is at http://code.google.com/p/v8/source/browse/trunk/src/builtins.cc?r=5388#731 When v8 is really fast, trim_array ends up true (e.g. actual_start == 20, item_count == 0, len == 100000, actual_delete_count == 20). And in that case what v8 does is to memmove the first 20 elements into the gap, call LeftTrimFixedArray(elms, 20) and then array->set_elemnts(elms). I would assume that times when v8 is slow the !Heap::lo_space()->Contains(elms) test fails.

Boris Zbarsky [:bzbarsky]

Comment 26

•

15 years ago

An it looks like the details of that have to do with their heap and gc setup, but the upshot is that sometimes they can left-trim (and that's super-fast in this case), and sometimes they can't.

Comment 27

•

15 years ago

Attached image Vectorized memcpy profile — Details

Confirming that SSE2 vectorized memcpy is used through memmove by V8 on the test case in the patch on Windows. __VEC_memcpy is part of the libc[mt].

Boris Zbarsky [:bzbarsky]

Comment 28

•

15 years ago

Yeah, to be clear I was comparing, profiling, etc on mac. And the memmove I was hitting doesn't seem to be vectorized.... We should compare this patch to the old code to memmove on at least Windows and Linux too, before making any changes to the copying.

Boris Zbarsky [:bzbarsky]

Updated

•

15 years ago

Blocks: 609835

Boris Zbarsky [:bzbarsky]

Updated

•

15 years ago

No longer blocks: 609835

Comment 29

•

15 years ago

Not sure if 'slice' (not splice) also falls under this bug's scope or if I should file a new bug...same issue with slow performance. My particular issue is I would like to take advantage of the new JS native typed arrays (used in conjunction with buffering downloaded data from sockets and en/decrypting for SSH) but they're too slow compared to using just plain old strings. On my machine (mac, 2Ghz, dual core): array concatenation (1,000x, ~16kb data): 744 ms string concatenation (1,000x, ~16kb data): 86 ms array concatenation (100x, growing by ~16kb data at a time): 1632 ms string concatenation (100x, growing by ~16kb data): 0 ms array slice (1,000x, ~16kb data): 341 ms string substring (1,000x, ~16kb data): 1 ms I have a rudimentary benchmark here (click the Run Benchmarks button): http://firessh.mozdev.org/developers.html

Nicholas Nethercote [inactive]

Comment 30

•

15 years ago

(In reply to comment #29) > > My particular issue is I would like to take advantage of the new JS native > typed arrays (used in conjunction with buffering downloaded data from sockets > and en/decrypting for SSH) but they're too slow compared to using just plain > old strings. String concatenation is done lazily. So if you want to measure the speed of string concatenation realistically you should probably do some stuff with the string afterwards, otherwise you'll get over-optimistic results.

Comment 31

•

15 years ago

Ok, I've updated the benchmark to do a simple c.charAt(c.length - 1) after each concatenation. This adds ~100ms to the string results but it's still wildly faster than arrays. Not to compare it to Chrome too much, but on Chrome the numbers are much more reasonable.

Boris Zbarsky [:bzbarsky]

Comment 32

•

15 years ago

The slice() test could in fact be this same issue, depending on how the slice() call is done. Please file a separate bug on whatever the typed array vs string issue is, with a clear testcase, and cc me? Typed arrays and native JS arrays are _very_ different beasts, so this bug has nothing to do with typed arrays.

Comment 33

•

15 years ago

anyone want to take this?

Comment 34

•

15 years ago

(In reply to comment #32) > Please file a separate bug on whatever the typed array vs string issue is, with > a clear testcase, and cc me? Typed arrays and native JS arrays are _very_ > different beasts, so this bug has nothing to do with typed arrays. Ah, to clarify - I would *like* to use typed arrays to take advantage of its speed (and to avoid large conversions of 8-bit numbers <-> strings) but b/c of the speed issue it's faster to use string as arrays. My little benchmark doesn't involved typed arrays actually. I'm just comparing regular JS arrays. i.e. var foo = [];

Boris Zbarsky [:bzbarsky]

Comment 35

•

15 years ago

> but b/c of the speed issue it's faster to use string as arrays. And my point was that whatever speed issue you're seeing there you should file a bug on, with an example showing the problem.

Comment 36

•

15 years ago

bug 639979 opened as an offshoot to this bug.

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Assignee: general → nobody

Wayne Mery (:wsmwk)

Updated

•

9 years ago

Keywords: perf

OS: Linux → All

Summary: Peacekeeper test arraySplice 37 times slower in Firefox → Peacekeeper test arraySplice 37 times slower in Firefox compared to Chrome

Updated

•

3 years ago

Severity: normal → S3

Comment 38

•

1 year ago

•

On the testcase in comment #23:

Nightly: 26ms-32ms
Chrome: 18ms

So about 50% slower now.

Reporter

Comment 39

•

5 months ago

Noticed this bug from twelve years ago in my open bug list. Updated title based on comment #38. Haven't retested.

Summary: Peacekeeper test arraySplice 37 times slower in Firefox compared to Chrome → Peacekeeper test arraySplice twice as slow in Firefox compared to Chrome

You need to log in before you can comment on or make changes to this bug.