Closed Bug 578912 Opened 11 years ago Closed 11 years ago

JM: MICs for fastnative calls

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sstangl, Assigned: bhackett1024)

References

Details

Attachments

(1 file, 4 obsolete files)

Certain native functions, especially Math natives, can be fast-pathed. Redefining these functions can be supported by a PIC at the callsite, one path of which goes to a fast-path.

As discussed with dvander, this requires creating "a new kind of PIC".

In the benchmarks we run against, this involves Math.pow(), Math.sqrt(), Math.sin(). These functions are called often in loops in math-heavy code, especially in v8's Raytracer, so the speedup should be worthwhile. Let's say 15ms to be conservative.
What is your estimate based on? The math functions themselves are pretty expensive, so the call overhead might not matter as much. I heard jsc is caching the actual results. That might be a more worth-while optimization (for benchmarking purposes, for actual web content thats of course borderline silly).
I had previously measured 250,000 calls to Math.sin() in SS; both SS and v8 contain many calls to Math functions, often in loops, so 15ms across both accounting for every Math operation does not sound unreasonable to me. I also am reminded that I operate in a (potentially useful) dimensional warp with six-year-old hardware.

Caching results of mathematical functions sounds very silly to me indeed, at least for things like sqrt(), which are part of SSE2.
Let's morph this, we need PICs to do it so it can be part of that work.
Blocks: JaegerCalls
No longer blocks: JaegerSpeed
Summary: JM: Add fast-paths for native functions. → JM: ICs for function calls
Back in the good old days of Pentium4 a sqrt in SSE2 was around 60 cycles latency. Might be a tad faster now. Definitely worth caching if you have repeated calculations of it. sin() is done in software these days. The FPU is dog slow. Even sqrt might be faster in software actually. I remember reading about that. Anyway. Can you take a quick value sample of the 250k calls? printf, uniq, sort -n will do. If we have a few dominant cases we should look into it.
Sure. I'll calculate precise numbers now.
Cool. Thanks.
Let's do those kind of function-specific measurements in a separate bug. We know the stub call overhead matters. ~50% of calls in SS are fast natives (~2% on v8) and bz has reduced test cases where a good deal of time is spent figuring out how to call functions.

We could further break this bug into MIC calls versus PIC calls (callprop w/ branded shape), the latter allowing us to bake in a function identity. We can start with MICs and grow from there.

We can also, after initial ICs are done, file follow-up bugs on common functions where it makes sense to inline them (like charAt and charCodeAt, and things in Math).
Blocks: 578916
Counts of Math and String fast natives are as follows:

On v8-v4:
 111870 str_replace
  41345 math_sqrt
  25467 str_charCodeAt
  18192 math_max
  13753 math_pow
   8208 math_random
   3861 str_split
   2051 str_charAt
   1687 math_floor
    615 str_match
     68 math_min
      3 str_toStringstr_charAt
      1 str_toStringmath_sqrt

On SunSpider-0.9.1:
 248370 math_sin
 203259 str_charCodeAt
  63488 math_pow
  48192 math_random
  47940 math_floor
  47349 math_sqrt
  32356 math_cos
  24093 str_charAt
  18000 str_concat
   3760 math_abs
   2713 str_replace
   2500 str_toLowerCase
   2500 math_log
   1852 math_round
   1009 str_split
     56 str_slice
      9 str_match
      4 math_max

sqrt() counts are reported in bug 578916.
Blocks: 579574
Assignee: general → dvander
Assignee: dvander → bhackett1024
Attached patch WIP MIC patch (obsolete) — Splinter Review
Patch for MICs on native calls.  Saves 23ms for me on SS, still failing a few tests.

Questions about GetScriptedCaller.  Some natives use this interface, and the pc and sp need to be stored to the JSStackFrame before this call.  This patch only does this for eval (only one I see in SS using this functionality); doing it for all MIC'ed native calls instead costs 4ms.  Is there a principled way of figuring out whether a native might inspect its scripted caller?  Could this be a new bit in JSFunction.flags?
(In reply to comment #9)
The pc and sp must be valid in C++ land in order for exception handling or GC to work.
(In reply to comment #10)
> (In reply to comment #9)
> The pc and sp must be valid in C++ land in order for exception handling or GC
> to work.

With the consolidated stack and zeroed new stack page allocation, couldn't we mark whole operand frames from the GC?

Even if not, we can reconstruct the stack depth from the pc. See jsopcode.cpp, the ReconstructStackDepth helper.

The bytecode pc could be reconstructed from the machine pc, too. Has anyone dug into that yet?

/be
wrt pc, that's the plan - not sure whether we'll get it in time for 4.0 though.

> With the consolidated stack and zeroed new stack page allocation, couldn't we
> mark whole operand frames from the GC?

pushing new frames overwrites whatever's immediately after sp. in a fast native this would mean reentering Execute or Invoke would blow away your vp.

we could push new frames after script->nslots instead though (wasting a little space and an extra computation).
(In reply to comment #12)
> wrt pc, that's the plan - not sure whether we'll get it in time for 4.0 though.

bhackett can do it, he's fast! ;-)

> > With the consolidated stack and zeroed new stack page allocation, couldn't we
> > mark whole operand frames from the GC?
> 
> pushing new frames overwrites whatever's immediately after sp.

Wait, in comment 11 you mentioned only exception handling and GC. Of course a call might need to store sp, but not every opcode would call -- or so I hope. Does this make sense?

/be
Attached patch revised patch (obsolete) — Splinter Review
This patch always writes out pc/sp to the frame regs, avoids touching any of the JSC assembler headers, and puts the MIC'ed call code in a fresh side buffer using a new NativeCallCompiler interface (this will make it simple to generate inline code for certain natives).
Attachment #462176 - Attachment is obsolete: true
Attached patch working patch (obsolete) — Splinter Review
This adds handling for throws and a couple other fixes.  Passes jstests and trace-tests.
Attachment #462287 - Attachment is obsolete: true
Attachment #462302 - Flags: review?(dvander)
This patch only generates a call path the first time any native is hit at a call site (rather than every time the MIC misses), better handling for polymorphic call sites that can ping-pong between several natives.
Attachment #462302 - Attachment is obsolete: true
Attachment #462302 - Flags: review?(dvander)
Attachment #462393 - Flags: review?(dvander)
Comment on attachment 462393 [details] [diff] [review]
patch handling polymorphic case better

>+            script->mics[i].pcOffset = mics[i].pcOffset;

Shouldn't need this, more on it later.

>+    /* Patch start with a jump to the call buffer. */
>+    start[0] = 0xE9; /* JSC::X86Assembler::OP_JMP_rel32; */
>+    *reinterpret_cast<int*>(start + 1) = (int) result - (int) start - 5;

This really wants some sort of abstraction in the MacroAssembler. Can you use CodeLocationJump and patch with that? See PolyIC for examples.

>+    /* Manually construct the X86 stack. TODO: get a more portable way of doing this. */

#ifndef and error here - okay to break x64 temporarily, we can hack up a quick call abstraction later.

>+    /* Store the PC. */
>+    jsbytecode *pc = script->code + mic.pcOffset;
>+    ncc.masm.storePtr(ImmPtr(pc),
>+                      FrameAddress(offsetof(VMFrame, regs) + offsetof(JSFrameRegs, pc)));

pc is already stored out, stubCall() does this for you.

>+    /* Store sp. */
>+    uint32 spOffset = sizeof(JSStackFrame) + (mic.frameDepth + mic.argc + 2) * sizeof(jsval);
>+    ncc.masm.addPtr(Imm32(spOffset), JSFrameReg, temp);
>+    ncc.masm.storePtr(temp, FrameAddress(offsetof(VMFrame, regs) + offsetof(JSFrameRegs, sp)));

Same. Sorry if I was misleading earlier - you should be able to remove this stuff, just not the one that happens in stubcc.call()/stubCall().

>+    /* Check if the call is throwing, and jump to the throwpoline. */
>+    Jump hasException = ncc.masm.branch32(Assembler::Equal, JSC::X86Registers::eax, Imm32(0));

Better as: branchTest32(Assembler::Zero, Registers::ReturnReg, Registers::ReturnReg)

Looks great so far, looking forward to next patch - will review ASAP.
Attachment #462393 - Flags: review?(dvander)
Brian corrected me IRL - we do need the pc/sp updates because we're patching the jump before the slow call, not the slow call itself.
This patch guards native call MICs on JS_CPU_X86 being defined (on other targets native calls won't be MIC'ed), and abstracts jump instruction generation into BaseAssembler::insertJump.  I don't see a way to use the repatch/link buffers for taking care of the call path jump, as it's inserting a new instruction entirely and the 0xE9 needs to come from somewhere.
Attachment #462393 - Attachment is obsolete: true
Attachment #462544 - Flags: review?(dvander)
Comment on attachment 462544 [details] [diff] [review]
patch fixing comments


>+#ifdef JS_CPU_X86
>+    static void insertJump(uint8 *source, const uint8 *target) {
>+        source[0] = 0xE9; /* JSC::X86Assembler::OP_JMP_rel32; */
>+        *reinterpret_cast<int*>(source + 1) = (int) target - (int) source - 5;
>+    }
>+#endif

Oh, fine :D hopefully this will be short lived - even better if you can lower this any further, like into X86Assembler. There are existing places, abstracted by Apple, where we change instructions. PICs patch LEA to MOV and back again.

(Also - are the page protections being flipped to modify "start"? If not and it's working out-of-box, makes me worry that we're RWX or DEP isn't kicking in.)

>+    /* Store the PC. */
>+    jsbytecode *pc = script->code + mic.pcOffset;
>+    ncc.masm.storePtr(ImmPtr(pc),
>+                      FrameAddress(offsetof(VMFrame, regs) + offsetof(JSFrameRegs, pc)));

You *can* get rid of mic.pcOffset use here though, it should be safe to bake in ImmPtr(cx->regs->pc)

r=me w/ last nit picked, no follow-up patch needed
Attachment #462544 - Flags: review?(dvander) → review+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: JM: ICs for function calls → JM: MICs for fastnative calls
Was this pushed? It is marked fixed but I don't see any http://hg.mozilla.org/projects/jaegermonkey/rev/...
You need to log in before you can comment on or make changes to this bug.