Closed Bug 501189 Opened 15 years ago Closed 11 years ago

Understand empty loop performance

Categories

(Core :: JavaScript Engine, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: gal)

References

Details

Attachments

(3 files)

Attached file Test case
On the attached test case, tracing should be at least as fast as any other technique. But instead, we run in 38 ms (8-9 cycles/iteration) while WebKit runs in 30 ms (6-7 cycles/iteration). (v8 is 104 ms (23 cycles/iteration).) Why? Presumably the generated x86 has most of the answer.
I tested the test-case version where an object is created each time. So ({}) instead of {} and reduced the loop count to 1000000 because otherwise the GC kicks in and we have to recompile everything 5 times. TM runs 117ms (234 cycles/iteration) non jit 1426ms, Webkit 69ms (140 cycles/iteration) and v8 56ms (112 cycles/iteration)
The shark tool says TM spends about 85-90% of the time in js_Object_tn. This method calls js_newGCThing where we spend about 60% of the time. The following newGCThing function only needs about 5%. So we spend about 50% of the time within js_newGCThing excluding called functions. These numbers are relative to the time spent in js_Execute.
So within the js_newGCThing function there is a very expensive division in the THINGS_PER_ARENA(nbytes); macro. Due to the shark tool we spend about 30% of the time waiting for the result of this division.
(In reply to comment #3) > So within the js_newGCThing function there is a very expensive division in the > THINGS_PER_ARENA(nbytes); macro. Due to the shark tool we spend about 30% of > the time waiting for the result of this division. Division!? My strength-reduction radar must have been broken when reviewing. /be
Comment 1 on seem to be in the wrong bug. This bug is about an empty loop. The bug about object allocation is bug 501186 -- right? /be
ups my fault. I only got linked to this bug and Andreas wanted to get numbers for the object creation loop. Anyways this macro with the division is called 1000386 times for this test case with 1000000 iterations.
Patching.
there is also another division in the same function: nbytes = JS_ROUNDUP(nbytes, sizeof(JSGCThing)); which goes to #define JS_HOWMANY(x,y) (((x)+(y)-1)/(y)) #define JS_ROUNDUP(x,y) (JS_HOWMANY(x,y)*(y))
You are still posting in the wrong bug. Lets move this to https://bugzilla.mozilla.org/show_bug.cgi?id=501186
(In reply to comment #8) > there is also another division in the same function: > nbytes = JS_ROUNDUP(nbytes, sizeof(JSGCThing)); > > which goes to > > #define JS_HOWMANY(x,y) (((x)+(y)-1)/(y)) > #define JS_ROUNDUP(x,y) (JS_HOWMANY(x,y)*(y)) Gregor: parting shot in this bug -- that macro with unsigned power of two y (sizeof(JSGCThing) is 8 or 16 and size_t) should not compile to division. Do you know for sure that a compiler is failing to reduce strength? /be
Attached file Empty loop trace asm
I've attached a human-readable assembly representation of the trace generated for the empty loop test. I found that I can increase the speed by 2x or more by removing unneeded code. trace run time (ms) -------------------------- ------------- baseline from js -j 35 remove SB1 27 remove SB2 25 remove SB3 31 remove SB1,SB2 25 remove SB1,SB3 31 remove SB2,SB3 31 remove SB* (all 3) 25 hoist sp 38 remove SB*, hoist sp 21 remove SB*, hoist sp,i 18 remove SB* & last store, hoist sp,i 18 all above + remove oflow guard 14 At least on this test, the on-trace store-backs are hurting us a lot: removing them is a 35/25=1.4x speedup. Hoisting doesn't make sense without removing the store-backs first, but if we do both, we are up to a 2x speedup. Removing the overflow guard gets us up to a 2.5x speedup. Doing that is hard in general but should not be too hard for loops with an integer iterator variable that is only set in the loop increment and is bounded by a constant.
Flags: blocking1.9.2+
Priority: -- → P2
-'ing this as dmandelin and gal say it shouldn't block. Sayre feel free to override.
Assignee: general → gal
Flags: blocking1.9.2+ → blocking1.9.2-
Current JS shell numbers for the attached testcase: Interp: 1014 ms, 223.08 cycles/obj -j: 1065 ms, 234.30 cycles/obj -m: 54 ms, 11.88 cycles/obj -m -n: 26 ms, 5.72 cycles/obj JM+TI seems to be ahead of the Webkit numbers given in comment #0. I don't have a d8 Windows binary to test. Looking at comment #13, it appears that there's still more room for improvement, though?
Blocks: 467263
Summary: TM: Understand empty loop performance → Understand empty loop performance
If V8 is using process isolation only, not a interrupt checking load on the loop edge, then it will win on empty loop benchmarks. Question is what is the cost of that load on minimal but non-empty loops? /be
I get these numbers: SM : 18 ms d8 : 26 ms Ion emits this for an empty loop in the global scope: [Codegen] cmpl $0x0, 0x0x2a12440 // interrupt check [Codegen] jne ((513)) [Codegen] movl 0x758(%eax), %ecx [Codegen] cmpl $0x989680, %ecx [Codegen] jge ((531)) [Codegen] addl $0x1, %ecx [Codegen] movl %ecx, 0x758(%eax) [Codegen] jmp ((545)) Same loop inside a function (local instead of global var): [Codegen] cmpl $0x0, 0x0x29cc240 // interrupt check [Codegen] jne ((159)) [Codegen] cmpl $0x989680, %eax [Codegen] jge ((171)) [Codegen] addl $0x1, %eax We can use asm.js style mprotect to eliminate the interrupt check (bug 864220), but we beat v8 now and I don't think we have to keep this bug open.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: