Closed Bug 593195 Opened 14 years ago Closed 14 years ago

TM: Blacklist based on iteration count/weight

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Unassigned)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 1 obsolete file)

* Background

In order to get good perf on certain workloads, we need to avoid running loops on trace that run for very few iterations. The cost of entering and leaving trace outweighs the benefit. In particular, there are 5 loops that we currently blacklist in earley-boyer; if we don't blacklist them, we run 2x slower. 2 of these loops are over arguments. The other 3 walk Scheme linked lists.

The current solution to this is to blacklist a tree if we enter that tree 300 times. This saves earley-boyer, but it hurts a lot on Dromaeo. In particular, Dromaeo runs 3bit inside a timing loop. The workload loop is fine to run on trace: we run it 5-10x faster that way, which is a big deal when scores are aggregated with geometric mean (bleh!!). But the workload loop is called 6000 times (when we don't blacklist), so it gets blacklisted, and our score goes way down.

It appears that what really makes a loop not worth running on trace is if it runs too quickly. One idea that has been proposed in the past is to measure the amount of time spent on trace, and blacklist if that is too small. That proved somewhat hard to get working. It also adds nondeterminism, which is terrible for debugging.

* Proposed Solution

We can instrument each trace with an iteration counter. The counter is initialized to 0 before entering the trace. When the counter reaches a threshold value |high| (e.g., high=300), we strip out the instrumentation immediately.

When we take a loop exit, we make a decision based on the counter value N and a threshold value |low|, (e.g., low=10):

  if N < low, blacklist the loop
  otherwise,  remove the instrumentation

All the usual variations can be tried, such as:

- Instead of just using the count, use a weight value W = N * L, where L is the number of JSOPs in the trace, or the length of the native code in the trace.

- Average multiple counter results. Apparently blacklisting after 300 entries does the trick on earley-boyer, so we could, for example, do a moving average and decide after several entries.

* Implementation Details

The other important bit is to make the instrumentation cheap when we are not using it. We can do it like this:

1. After we generate a trace T1, generate a second tiny trace Tc that just increments the counter and calls a strip-instrumentation function if counter > |high|. Note that so far, this trace is a loop.
2. Patch the jump at the end of T1 to Tc.
3. Patch the jump at the end of Tc to the loop header of T1.

To strip the instrumentation, we simply patch the jump at the end of T1 to the loop header of T1 again.

There seems to be one problem remaining: how to make sure Tc doesn't clobber the register holding 'state'.
(In reply to comment #0)
> 
> * Implementation Details
> 
> The other important bit is to make the instrumentation cheap when we are not
> using it. We can do it like this:
> 
> 1. After we generate a trace T1, generate a second tiny trace Tc that just
> increments the counter and calls a strip-instrumentation function if counter >
> |high|. Note that so far, this trace is a loop.
> 2. Patch the jump at the end of T1 to Tc.
> 3. Patch the jump at the end of Tc to the loop header of T1.
> 
> To strip the instrumentation, we simply patch the jump at the end of T1 to the
> loop header of T1 again.
> 
> There seems to be one problem remaining: how to make sure Tc doesn't clobber
> the register holding 'state'.

That sounds complicated.  For prototyping purposes I'd just put this in at the top of the loop:

  if (count < low) {
      count++;
  }  

On loops that run more than 'low' times you'll end up having an extra (failing) test for every iteration after the 'low'th one, but I suspect that won't be so bad.  Every loop already has a callbackFlag check and that doesn't cost much (ie. I've tried removing it and the effect was negligible).

Then if you find that this approach works well, you could try to jump through the hoops to avoid the instrumentation cost :)
(In reply to comment #1)
> That sounds complicated.  For prototyping purposes I'd just put this in at the
> top of the loop:
> 
>   if (count < low) {
>       count++;
>   }  

Yay!! It looks like that version has no cost, at least in shell SunSpider. ++njn for sure on making this idea a lot easier to hack up.
Attached patch WIP (obsolete) — Splinter Review
This does the basic strategy, and makes earley-boyer work OK, but it regresses SunSpider with -m -j by about 20 ms. I'm not sure why--it just makes most benchmarks 5% slower. I'm thinking it's probably too aggressive about blacklisting. I'll just have to play with different parameters and blacklisting patterns to see what happens.
Could we end up running into issues with this setup if a short-running traceable loop is inside the body of a long-running traceable loop?  Or will the blacklist keep us from entering the short-running loop from JM but allow it to work from interp/tracing?
You're increasing the counter up to 300, but there's no need once you hit 5 since that's the value you're checking for on exit.  And those two values shouldn't be magic numbers :)
I don't think the counter load needs to be marked as volatile.

I measured with Cachegrind, AFAICT the SS slowdown is entirely due to the instrumentation.  The worst case is bitwise-and which executes 1.08x more instructions.

And I only saw a 1% instruction count improvement for earley-boyer.
(In reply to comment #5)
> And those two values shouldn't be magic numbers :)

If you counted down from 5 to 0 you wouldn't need to have the numbers in two places -- the loop exit would just test for zero.
(In reply to comment #5)
> You're increasing the counter up to 300, but there's no need once you hit 5
> since that's the value you're checking for on exit.  And those two values
> shouldn't be magic numbers :)

Hey, that's why it's a WIP. But thanks for your help on this. I think you are right about the volatile part.

I'm not sure about the slowdown: I get pretty much the same time score when I just add the instrumentation. It only seems to slow down when I use it to blacklist. Also, if I use a threshold of 2 (blacklist only if count < 2), I seem to get the same SS score. It's confusing, though, because there are at least 3 variables in play: instrumentation, old blacklisting algorithm, and new blacklisting algorithm. 

Anyway, I think the WIP is pretty close to "working", so I will clean it up and post for review once it is performing correctly. I think it's possible to do better with more experimentation, but this bug blocks landing to m-c.
fwiw, with this wip applied and the title patch applied I get about 3900runs/s on 3bits-in-byte on t-m.  I get 4600runs/s on m-c.
Attached patch PatchSplinter Review
OK, this patch fixes the worst problems on Dromaeo. With this patch, we regress only 2% relative to beta4. It doesn't hurt shell V8 at all. It regresses shell SS by 5 ms or so. 

I'm sure I can improve it a bit more, but it would be nice to get us landable to m-c. Not sure what the requirement is there.
Attachment #471732 - Attachment is obsolete: true
Comment on attachment 471973 [details] [diff] [review]
Patch

OK, I think this is about good enough to make us landable, but I will keep working to improve it. Suggestions welcome.

I checked browser V8: this patch gave a 70-point improvement, but that may be just noise. At any rate, it is as good as tip now, which is an almost 2x improvement over beta 4.
Attachment #471973 - Flags: review?(dvander)
Comment on attachment 471973 [details] [diff] [review]
Patch

>+    //printf("%s:%u: %d\n", f->treeFileName, f->treeLineNumber, JS_THREAD_DATA(cx)->iterationCounter);

this, and
 
>+        //printf("blacklist %s:%u %d\n", f->treeFileName, f->treeLineNumber, JS_THREAD_DATA(cx)->iterationCounter);

this, looks like old debug code.
Attachment #471973 - Flags: review?(dvander) → review+
http://hg.mozilla.org/tracemonkey/rev/5d10c2119565

Marking this bug fixed--I'll file more bugs if I come up with new patches worth landing.
Whiteboard: fixed-in-tracemonkey
Depends on: 593497
> With this patch, we regress only 2% relative to beta4.

On Dromaeo sunspider total?  Or on 3bits in particular?
(In reply to comment #14)
> > With this patch, we regress only 2% relative to beta4.
> 
> On Dromaeo sunspider total?  Or on 3bits in particular?

On Dromaeo total. Bug 593495 has more about why. fannkuch and nsieve-bits take a big hit due to the fall-into-interp problem. If we fixed that, we'd probably be improving things instead.
Aha, thanks!

With this and the title thing fixed, how do things look on the DOM dromaeo tests that had regressed?
http://hg.mozilla.org/mozilla-central/rev/5d10c2119565
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: