tracemonkey performance drops if the function contains lots of variables, math operations and array item access

RESOLVED WONTFIX

Status

()

defect
RESOLVED WONTFIX
9 years ago
9 years ago

People

(Reporter: yury, Assigned: dvander)

Tracking

({perf})

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking2.0 betaN+)

Details

Attachments

(1 attachment)

6.38 KB, application/javascript
Details
User-Agent:       Mozilla/5.0 (X11; Linux i686; rv:2.0b5pre) Gecko/20100827 Minefield/4.0b5pre
Build Identifier: 

The helper code (compile function) automatically generate the other function body: unrolls the loops, creates js code, evals it. (FFT algorithm) The generates code initializes variables froms the input array, performs calculations with the variables, and stores result in the output array -- no loops.

The testing results via JS shell:
"-j" and compile() called => interations: 100; total dur: 11178ms; per item: 111.78ms
just compile() called => interations: 100; total dur: 1860ms; per item: 18.6ms
"-j", no complile() => interations: 100; total dur: 170ms; per item: 1.7ms
no compile() => interations: 100; total dur: 4465ms; per item: 44.65ms

(Attaching the code)

Reproducible: Always

Steps to Reproduce:
Run the js shell to measure the speed: with and without "-j" option
Actual Results:  
the js shell and the code executes slower with the "-j" argument

Expected Results:  
the js shell and the code executes faster with the "-j" argument
Posted file Original test
Sorry, could not simplify to repeat the problem
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: x86 → All
Using the attached testcase and the current JM (with tracing integration, therefore) tree:

  mozilla% ./js/src/opt-obj/js ~/foo.js
  iterations: 100; total dur: 385ms; per item: 3.85ms

  mozilla% ./js/src/opt-obj/js -j ~/foo.js
  iterations: 100; total dur: 4204ms; per item: 42.04ms

  mozilla% ./js/src/opt-obj/js -m ~/foo.js
  iterations: 100; total dur: 16851ms; per item: 168.51ms

  mozilla% ./js/src/opt-obj/js -j -m ~/foo.js
  iterations: 100; total dur: 21266ms; per item: 212.66ms

With the compile() call taken out:

  mozilla% ./js/src/opt-obj/js ~/foo.js
  iterations: 100; total dur: 955ms; per item: 9.55ms

  mozilla% ./js/src/opt-obj/js -j ~/foo.js
  iterations: 100; total dur: 607ms; per item: 6.07ms

  mozilla% ./js/src/opt-obj/js -m ~/foo.js
  iterations: 100; total dur: 613ms; per item: 6.13ms

  mozilla% ./js/src/opt-obj/js -j -m ~/foo.js
  iterations: 100; total dur: 393ms; per item: 3.93ms
When the "compile" call is there, the fft.forward functions is a LAMBDA_NULL_CLOSURE.  Same without the "compile" call.  However the function is a _lot_ bigger with the compile() call.

I tried geting jitstats with the compile() call, but:

  mozilla% env TMFLAGS=stats ./js/src/debug-obj/js -j ~/foo.js
  Assertion failure: base == lirbuf->sp || base == eos_ins,
                     at ../jstracer.cpp:3797
  Bus error

on both JM and TM trees.
blocking2.0: --- → ?
I filed bug 591788 on the assert.
And it's not an issue with the eval; if I just take the generated source and set the function to that in the original script it's still slow.

Given comment 0, part of the issue is that you're explicitly undoing optimizations that at least the tracer does.  For example, by unrolling a loop you're just making the tracer compile a _lot_ more code to get a pretty similar result (before, only one copy of the loop needed to be compiled, but now it's however many you generated).  I would expect something similar to be the case for JM.

In particular, when I profile, for JM 94% of the time is spent under js::mjit::Compiler::Compile, almost all of it under generateMethod.  In particular, a lot of time is spent compiling the code that syncs state (26% in, not under, ImmutableSync:syncNormal, 20% in ImmutableState::sync, 20% in FrameState::sync 17% in FrameState::merge, 4% in FrameState::uncopy).

For TM, 87% of the time is spent under record_JSOP_GETELEM, so again compilation (in this case, of gets from arrays).  Almost all this time is spent under snapshot(); some in VisitFrameSlots (most of it, in fact), and some in checkForGlobalObjectReallocation.  Hard to tell where the latter is being called from, since so much stuff is inlined here, but I'd suspect the getFromTrackerImpl and/or knownImpl callsites.  Again, this is all code designed to sync state between the jit and the interpreter, afaik.

In the short run, the right answer seems to be to not do this compile() thing.  I would be somewhat suprised if it helps in any engine with a jit, though maybe one that never has to sync back to an interpreter might not be too bad off.  

In the longer run, can we detect this sort of thing and abort compiling it?

It _is_ sort of impressive that hand-compiling the JS to JS produces a result as fast under interp as -j -m on the original source, though.  ;)

Oh, I tried bumping the iteration count from 100 to 1000.  While this doesn't affect the per-item time for the interpreter on the compiled fft or the -m time on the uncompiled one (which is slower than interp on compiled fft on this testcase, per above), the -j time on uncompiled goes up to 8ms per item, and the combined -j -m on uncompiled goes up to 6ms/item.  So it's actually a win to compile + interp....  We should _really_ be able to beat that with jm on the uncompiled fft, I'd think!
Not good. JM is getting some worst-case behaviour in the sync algorithm, which is O(n) where n is the number of "tracked" slots. We normally expect that "tracked" is significantly lower than nslots. Here that's not true, and the code is very linear - no branches giving us the opportunity to clear the tracker.

Syncing needed to be O(slots) until we started relying on conservative GC. All it needs to be now is O(k + m) where k=registers and m=stack uses of current op.

There's a second sync algorithm that kicks in during extreme register pressure. I don't see it showing up in Shark though - I'll fix the main one and see what that gets us.
That's for JM on the attached testcase, right?  It may also be worth looking into what happens if you comment the compile() line in the testcase out with JM...
Yes. With the patch queue in bug 591836, on my MBP:

w/ Compile:

./Opt32/js -m k.js
interations: 100; total dur: 1088ms; per item: 10.88ms

./Opt32/js k.js
interations: 100; total dur: 434ms; per item: 4.34ms

~/projects/v8/shell k.js
interations: 100; total dur: 302ms; per item: 3.02ms

~/projects/WebKit/WebKitBuild/Release/jsc k.js 
interations: 100; total dur: 684ms; per item: 6.84ms

w/out Compile:

/Opt32/js -m k.js
interations: 100; total dur: 128ms; per item: 1.28ms

./Opt32/js k.js 
interations: 100; total dur: 577ms; per item: 5.77ms

~/projects/v8/shell k.js 
interations: 100; total dur: 237ms; per item: 2.37ms

~/projects/WebKit/WebKitBuild/Release/jsc k.js 
interations: 100; total dur: 101ms; per item: 1.01ms
Also worth noting, in JM 13% of time goes to FrameState::uncopy(), the rest is pretty distributed amongst the parser and method JIT compiler.
(In reply to comment #5)
> 
> For TM, 87% of the time is spent under record_JSOP_GETELEM, so again
> compilation (in this case, of gets from arrays).  Almost all this time is spent
> under snapshot(); some in VisitFrameSlots (most of it, in fact), and some in
> checkForGlobalObjectReallocation.

The patch in bug 592177 gives me a 1.11x reduction in instruction counts and about a 1.08x reduction in time.
(In reply to comment #5)
> 
> For TM, 87% of the time is spent under record_JSOP_GETELEM, so again
> compilation (in this case, of gets from arrays).  Almost all this time is spent
> under snapshot(); some in VisitFrameSlots (most of it, in fact),

Yes, VisitFrameSlots() is hot for other compile-heavy benchmarks like 3d-raytrace and crypto-md5.  CountSlotsVisitor and DetermineTypesVisitor are the two visitor classes that account for probably 90%+ of the VisitFrameSlots() calls;  of the two, DetermineTypesVisitor is a bit more important, maybe by a ratio of 60:40 or so.
(In reply to comment #11)
> 
> Yes, VisitFrameSlots() is hot for other compile-heavy benchmarks like
> 3d-raytrace and crypto-md5.  CountSlotsVisitor and DetermineTypesVisitor are
> the two visitor classes that account for probably 90%+ of the VisitFrameSlots()
> calls;  of the two, DetermineTypesVisitor is a bit more important, maybe by a
> ratio of 60:40 or so.

That's true for the Sunspider ones;  for the benchmark in this bug DetermineTypesVisitor accounts for almost all of the VisitFrameSlots calls.
AFAICT the vast majority of DetermineTypesVisitor calls are due to TraceRecorder::snapshot().  In particular, every time we call snapshot() we recompute the entire typemap, even though in practice the typemap will be very similar to the previous typemap.  Surely we can take advantage of this fact somehow.
Seems like it.  File a bug blocking this one?
Michael Bebenita has researched various ways of compressing snapshots in tracing JITs. Cc'ing him.

/be
Andreas, Michael: I recall a bug already on file to use Michael's research results to make snapshotting more efficient. Can you find the bug number?

/be
Having to walk 4,300 local slots every snapshot() to determine the typemap is sucky (academically, "total clownshoes"). We could keep track of the typemap incrementally as we record - which sounds tricky and error-prone given the current TR design. Easier might be caching the last known typemap and watermarking the -/+ positions on the stack, to only recompute a slice at the end.

We'd have to be careful of opcodes like SETNAME or SETELEM which can change random stack slots.

My guess is that test cases like this should be under the "don't bother tracing" category right now. If nfixed > XYZ, don't even record.
(In reply to comment #17)
>
> Easier might be caching the last known typemap and
> watermarking the -/+ positions on the stack, to only recompute a slice at the
> end.

We're actually duplicating snapshots reasonably often, eg. 20--40% of them.  ie. we do snapshot(BRANCH_EXIT) and then repeat it almost immediately afterwards.  I'm partway through a patch to fix this.
blocking2.0: ? → betaN+
Assignee: general → dvander
Depends on: 594591
(In reply to comment #18)
> (In reply to comment #17)
> >
> > Easier might be caching the last known typemap and
> > watermarking the -/+ positions on the stack, to only recompute a slice at the
> > end.
> 
> We're actually duplicating snapshots reasonably often, eg. 20--40% of them. 
> ie. we do snapshot(BRANCH_EXIT) and then repeat it almost immediately
> afterwards.  I'm partway through a patch to fix this.

Nick, could you file, and make the new bug block this bug? This bug is meta and what you describe sounds great -- could be an easy win for a number of benchmarks.

/be
(In reply to comment #19)
> > 
> > We're actually duplicating snapshots reasonably often, eg. 20--40% of them. 
> > ie. we do snapshot(BRANCH_EXIT) and then repeat it almost immediately
> > afterwards.  I'm partway through a patch to fix this.
> 
> Nick, could you file, and make the new bug block this bug? This bug is meta and
> what you describe sounds great -- could be an easy win for a number of
> benchmarks.

I could, but my patch turned out to have some bogus assumptions in it and avoiding the duplicates is harder than it first seems.  I filed and fixed bug 594640 which caught a common case that was easy to fix.  In other cases the duplicates are created in separate functions in jstracer.cpp with non-trivial control flow, eg. sometimes the duplicate is created and sometimes not.  So I don't have much to say if I were to open a new bug.
Also, my 20--40% number was based on those bogus assumptions;  the number of genuine duplicates is probably less... it's not clear to me exactly how to measure it, suggestions are welcome.
Depends on: 601109
No longer depends on: 601109
Bug 594591 has been opened to abort recording in the tracer if there are too many slots.  Comment 8 indicates that the results for this program are reasonable in the methodjit.  And my ideas about optimizing snapshots have petered out after bug 594640.

I don't think anything else concrete will come out of this bug.  Please reopen (with explanation) if you disagree.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.