Open Bug 624299 Opened 14 years ago Updated 2 years ago

2x slower than v8 on recursion+scope chain testcase

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

People

(Reporter: bzbarsky, Unassigned)

References

()

Details

(Whiteboard: [js:t] [js:perf])

Attachments

(2 files, 2 obsolete files)

See bug 614834 comment 27.  The testcase in question is in the url field.
The ratio improved, but we're still slowest here (all numbers on my rMBP@2.7Ghz):

SpiderMonkey:
0.58
0.56
0.565
0.56
0.5575

JSC:
0.34
0.33
0.32
0.3175
0.3175

d8:
0.3
0.29
0.295
0.3325
0.29625
OS: Mac OS X → All
Hardware: x86 → All
Summary: 4x slower than v8 on recursion+scope chain testcase → 2x slower than v8 on recursion+scope chain testcase
Whiteboard: [js:t] [js:perf]
Assignee: general → nobody
Firefox 33 is faster than Chrome 39 for me.

Firefox goes from 0.60 to 0.45 and Chrome goes from 0.70 to 0.55
For me (same setup as in comment 1), we're still slowest (and note the progress JSC has made):

SpiderMonkey:
0.46
0.44
0.45
0.4525
0.4575

JSC:
0.22
0.2
0.21
0.2175
0.215

d8:
0.26
0.31
0.28
0.2775
0.2625

Current Nightly and Canary also reflect this. Safari is about 50% slower than JSC, but still faster than us.
This is a lot faster on 32-bit. On OS X I get 0.23-0.26 ms with an x86 build, 0.39-0.42 with an x64 build.

Could be our boxing format or us spilling more registers somewhere, we should investigate.
Attached file 32-bit JIT Inspector output (obsolete) —
Attachment #8527735 - Attachment is obsolete: true
Attachment #8527736 - Attachment is obsolete: true
Some thoughts in no particular order:

1)  The overall time or the testcase on 32-bit is about 0.25 * (50 + 100 + 200 + 400 + 800) = 387.5ms.  The x86-64 times are about 2x that, in the 800-900ms range.  So we need to account for about 400-500 ms of runtime.

2)  The testcase executes about 300e6 Unbox:Int32 instructions.  On x86, there's nothing to do for these if we know we have an int.  On x86-64, these correspond to a single movl.  What this means on the hardware, I don't know, but if we assume that takes one cycle, that's 300e6 cycles, the CPU is at 2.6GHz, so about 115ms.  But worse yet, in some of these cases we don't know we have an int.  In that case, on 32-bit we get things like:

[MoveGroup]
    movl       %edx, %eax
[Unbox:Int32]
    cmpl       $0xffffff81, %ecx
    jne        ((366))

And on 64-bit we get:

[Unbox:Int32]
    movq       %rcx, %r11
    shrq       $47, %r11
    cmpl       $0x1fff1, %r11d
    jne        ((383))
    movl       %ecx, %eax

So that's an extra move and shift, though on 32-bit presumably we paid part of that cost when we initially placed the high 32 bits of the Value in ecx.

3)  On X86-64 there's an extra MoveGroup before the first CallKnown.  But the actual call is cheaper, and in any case there aren't _that_ many CallKnowns here (about 75e6).

So my money is that the main culprit here is the Unbox:Int32 bits.
(In reply to Please do not ask for reviews for a bit [:bz] from comment #9)
> So my money is that the main culprit here is the Unbox:Int32 bits.

Yes, I have a patch for x64 Unbox that gets us close to the 32-bit numbers. Will post soon, after testing what it does on some other benchmarks.
Depends on: 1104199
(In reply to Jan de Mooij [:jandem] from comment #10)
> Yes, I have a patch for x64 Unbox that gets us close to the 32-bit numbers.
> Will post soon, after testing what it does on some other benchmarks.

Bug 1104199. With the patch there:

x64 before: 0.44, 0.38, 0.425, 0.4,    0.4075
x64 after:  0.28, 0.27, 0.245, 0.26,   0.25125
x86:        0.24, 0.23, 0.25,  0.2475, 0.23625

d8 x64:     0.26, 0.23, 0.235, 0.2425, 0.22625
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: