Closed Bug 631706 Opened 9 years ago Closed 9 years ago

methodjit loopless functions on the second call only

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED DUPLICATE of bug 631951

People

(Reporter: dmandelin, Unassigned)

References

Details

See 631581 for measurements suggesting this is a great idea. (Also measurements by Luke that aren't in a bug.) It is expected to cut methodjit memory usage about 2x, and also make web sites a hair faster.
The important question right now is, do we want to try this for Fx4.

Main benefits:

+ Cuts jitcode memory usage by *roughly* estimated 2x in browser
+ Cuts total browser memory usage by *roughly* estimated 3-10%.

Main risks:

- May regress SunSpider performance. Prima facie, this seems unlikely, as not
  compiling things that run only once should just win. But there could be
  interaction effects: e.g., does it mess up trace tuning?
- May introduce bugs. This also seems unlikely, as I believe it would just be
  adding a hitcount to the call ICs.
- Fixing this bug is obviated by the "compile methods only after N calls" bug
  I am about to file. That bug is harder, but if we did it, this one might be
  wasted effort.
I really like this idea. Should be easy 4.0-fodder to mark loop existence while parsing and then add a 2-bit hit counter for interpreter hotness.

I suspect that once we've decided to method JIT because something is hot, we should keep method JIT'ing calls from in that VMFrame. As long as we don't fiddle with that, the potential for regression is really low.
If we're doing something, seems like we might as go all the way with bug 631714.  This would be a lot better than nothing, though.
The simplest thing that could possibly work is often best, and while bug 631714 would be more complete, a patch for this bug would be done sooner, letting us measure to analyze perf effects and memory wins, test for regressions, etc.

/be
(In reply to comment #4)
> The simplest thing that could possibly work is often best, and while bug 631714
> would be more complete, a patch for this bug would be done sooner, letting us
> measure to analyze perf effects and memory wins, test for regressions, etc.

The simplest thing is rarely best when it comes to achieving high performance.

More importantly and more specifically, the difference in complexity between the one-counter approach and the two-counter approach is small.  (The one-counter approach is just the two-counter approach with one of the thresholds set to zero.)  AFAICT, cdleary has already implemented the two-counter approach (http://hg.mozilla.org/users/cleary_mozilla.com/tm-interp-mq/ from bug 631740 is hard to read due to some hg clumsiness) and the patch is quite small.  But the difference in effectiveness between one-counter and two-counter is large -- a 2x memory use reduction vs. a 5--10x reduction, according to dmandelin's calculations in bug 631581.

As for the risks raised by dmandelin in this bug and bug 631714:

- Sunspider doesn't look to be a problem, judging from bug 631740.  We'd have to set the counter threshold to outlandish values to regress it much.  And Sunspider is a good test for speed regressions, because most of the functions should end up being compiled, I think.

- In bug 631581 dmandelin assumed times of t, 4t and 500t for compiled execution, interpreted execution, and compilation.  These seem reasonable to me but are there any cases where they might be drastically wrong?  The worrying case would be something that is *really* slow when interpreted, so much so that interpreting it a few times would hurt.  At first I thought eval might be like that, but I think each eval call becomes its own JSScript, in which case it wouldn't be a problem?  Anything else like that -- maybe some regexp or string operations?

- As for the trace tuning being thrown off... let's assume the thresholds we choose are small (I think 16 for both counters seems reasonable, or maybe even 8).  If a program's performance is hurt greatly by a small number of interpreted executions prior to compilation, it must be very sensitive to small changes anyway, and if we're getting good perf right now it's probably due to luck.  So my inclination would be to check that nothing in SS/V8/Kraken is affected, and not worry too much beyond that.  Having said that, small thresholds (like 16 or 8) seem better than large thresholds because they will cause the least perturbation to trace tuning.

Obviously, lots of measurement ASAP is needed here.  I'm happy to help.
(BTW, this one idea of not compiling everything is currently split across four bugs:  bug 631581 which has some space measurements and analysis, bug 631706 and bug 631714 which propose the one-counter and two-counter approaches respectively, and bug 631740 which has some time measurements and an implementation of the two-counter approach.  I find this four-way split confusing.  Does anyone object if I file a new bug with a summary of findings so far and pointers back to the four existing bugs, and mark the four existing bugs as dups of the new bug?  All subsequent work and discussion would go in the new bug.)
Incidentally, Type Inference wants this anyway: see bug 621077.
(In reply to comment #5)
> (In reply to comment #4)
> > The simplest thing that could possibly work is often best, and while bug 631714
> > would be more complete, a patch for this bug would be done sooner, letting us
> > measure to analyze perf effects and memory wins, test for regressions, etc.
> 
> The simplest thing is rarely best when it comes to achieving high performance.

True, look at all our complex code! One bit seems better than one or two counters, still. What am I missing?

I was commenting here to be specific: this bug, not the more general one, would be the one to fix, if one bit is enough.

It does seem having more than one bug is hurting. Thanks for connecting the bug dots.

/be
(In reply to comment #8)
> One bit seems better than one or two counters, still. What am I missing?

"Better" in what sense?  Space-wise, dmandelin's number show that two counters gives much bigger reductions, eg. 5-10x instead of 2x.

Complexity-wise, I think going from zero counters to one counter (even if it's only 1 bit, it's still a counter) is a much bigger step than going from one to two;  in other words, one counter is only marginally better (simpler) than one.

> I was commenting here to be specific: this bug, not the more general one, would
> be the one to fix, if one bit is enough.

Fair enough.  I'm arguing that one bit isn't enough.

> It does seem having more than one bug is hurting. Thanks for connecting the bug
> dots.

I'll file the new one.
I'd take a 2X memory win over 5-10X if it meant simpler heuristics that don't potentially delay shipping because of correctness or performance regressions.
(In reply to comment #10)
> I'd take a 2X memory win over 5-10X if it meant simpler heuristics that don't
> potentially delay shipping because of correctness or performance regressions.

This was my point in comment 4. I fear I pressed an njn button with simplest-thing-that-might-work meme. Sorry about that, it's all relative to release endgame and timing.

cdleary's work in bug 631740 looks pretty good at a glance, OTOH. David, did you have a look?

/be
(In reply to comment #11)
> cdleary's work in bug 631740 looks pretty good at a glance, OTOH. David, did
> you have a look?

It only touches the interpreter which is good, but backedge counting is invasive this late in the game. Chris said on Friday that on a browser build, gmail didn't load with that patch (or some iteration of it, he was going to retest or debug in Monday).

If we can get the mechanics in and get correctness regressions out, then we can adjust for the performance regressions by just turning the counters down. Worst case, we use the same amount of memory we're using now. But it's the correctness that I worry about more.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 631951
(In reply to comment #11)
> (In reply to comment #10)
> > I'd take a 2X memory win over 5-10X if it meant simpler heuristics that don't
> > potentially delay shipping because of correctness or performance regressions.
> 
> This was my point in comment 4. I fear I pressed an njn button with
> simplest-thing-that-might-work meme. Sorry about that, it's all relative to
> release endgame and timing.

It's the whole issue of memory usage that's the njn button :)

A key question:  how much delay is a 5-10x reduction worth vs. a 2x reduction?  dvander seems to be arguing that no delay is worth it, I'm arguing a non-zero delay is worth it.  Obviously this is up for discussion.  Perhaps "attempt two-counter if we can get it into a beta, attempt one-counter if we cannot" is worth considering as a compromise.
We really don't have much time left. b12 is just to help ramp down blockers to rc1, which had better be "no known blockers" or it's b13.

So no fancy knots, please -- simple, reliable, fast knots only.

/be
You need to log in before you can comment on or make changes to this bug.