Closed Bug 542905 Opened 10 years ago Closed 9 years ago

cse chains should be cleared more selectively in case of labels

Categories

(Core Graveyard :: Nanojit, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: gal, Assigned: wmaddox)

References

Details

(Whiteboard: PACMAN, fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)

Attachments

(4 files, 1 obsolete file)

We currently flush the CSE state in every label because it could be the tail of a control flow diamond that contains definitions we put inside the diamond that goes dead now.

 a
/ \
| b
| |
\ /
 *

In *, a is dead, but b is still alive. Since we don't have an explicit way to just kill that one (now dead) cse-able instruction b, we clear the entire table.

There are several ways how we could fix this. We can explicitly destroy cse-entries as we merge register allocator states. Might be tricky.

A more explicit way would be to communicate scopes to nanojit and give it an understand of the control-flow structure.

In the alternative we could explicitly disable entering values into the cse table while compiling such control flow splits, either in a global way (cse_off()) or in a scoped way. The latter is probably more desirable for tamarin which uses complex control flows, where ours is usually just a sequence of small diamonds.

 |
/ \
\ /
 |
 |
/ \
\ /
 |
Hmm.  The whole diamond support in NJ is really primitive.  That could make this difficult.
Also, CSE is occasionally a big win (I think one of the SunSpider crypto ones goes 2x faster or 4x faster or something with it) but most of the time doesn't help, and it's one of the more expensive parts of NJ compilation.
Could we maintain a stack of CSE tables that represents the dominators of the current block, as code is emitted?  at a branch, push an entry, and at the label of the branch, pop it.  

i'm handwaving a bit, but if this modeling of dominators is accurate, then we win.

for this to be robust, branch patching probably cannot be allowed by mutating a LIns* directly; it should be a call that flows through the LirWriter pipeline so each stage can see what is being patched.
Does anyone have some example code where the current approach causes problems?  One reason I ask is that CSE is one of the more expensive parts of compilation and so we have to be careful about adding new bits to it.
(In reply to comment #3)
> Could we maintain a stack of CSE tables that represents the dominators of the
> current block, as code is emitted?  at a branch, push an entry, and at the
> label of the branch, pop it.

Right; that was my first thought too.

Except it's a bit overly pessimistic.  At the end of the diamond we do
need to pop the topmost table (available-expressions set), but we can
add to the one underneath, any new expressions which get defined by
phi-nodes associated with (iow, immediately following) the diamond.
Depends on: 545270
> In the alternative we could explicitly disable entering values into the cse
> table while compiling such control flow splits, either in a global way
> (cse_off()) or in a scoped way. The latter is probably more desirable for
> tamarin which uses complex control flows, where ours is usually just a sequence
> of small diamonds.

I think this (the scoped version) sounds like it'll provide the best cost/benefit trade-off.

It's nice to talk about diamonds but we only have jumps and labels, which makes things a bit harder.  Here's what I think is about the simplest possible algorithm that works:

- If we hit a label for a backward jump, clear everything.  I think we don't want to CSE any value defined before a loop without also having a 'live' for it after the loop.  This doesn't seem bothering with because currently nothing of interest is computed before we enter a loop?  Maybe if we did loop-hoisting (bug 545406) it might be worthwhile but I don't see that happening for a while.

- If we hit a forward jump, set the 'cse_off' bit, and record the jump in a list.  While 'cse_off' is set we are in a section of code that isn't always executed, so we can perform CSE using already-available expressions already stored, but we can't mark new expressions as available.

- If we hit a label for a forward jump, remove that jump from the jump list.  If the jump list becomes empty as a result, clear the 'cse_off' bit -- we're now back into always-executed code.  In order to actually get the forward jump when we hit the label, we'll need to incorporate LIns::setTarget() somehow into the writer pipeline (like what Ed said in comment 3).

- As for working out load/store aliases, I think there'd be little change -- if a not-always-taken path contains a store that invalidates a load, then we just be conservative and invalidate that load.

This is a fairly minor change to CSE and the writer pipeline, and no change to LIR, but it'll greatly increase the scope for CSE.  It does miss CSE opportunities for values produced and consumed within a diamond.  For TM at least I think that'll hardly matter, because our diamonds are small.  If TR's diamonds are bigger then more opportunities might be missed, but it'll still be much better than the current approach of trashing the entire CSE state.
slight tweak:  I think it's a label list rather than a jump list.  Say you have two jumps to the same label (e.g. from a short-circuit && expression);  once you see the label, remove it from the list, turn cse back on.  intuitively this is Rightious, because once you've seen the last label (no forward branches left to patch), you must be in code that's always executed.

for TR compiling whole methods at a time, we probably get hosed by labels clearing CSE, but we'd probably also get killed by not supporting CSE inside an if statement, e.g.

function foo() {
  if () {
    /* tons of code */
  }
}

I think the way we'd use this is for inlined primitives that expand into small diamond flows, but we'd want to leave CSE working as-is for the large-scale control flow structure of the original program.  hmm...
(In reply to comment #7)
> slight tweak:  I think it's a label list rather than a jump list.

Conceptually they're equivalent.  In practice a label list won't work.  When you switch on 'cse_off' you have the jump at hand, but not the label, because the label doesn't exist yet.

> for TR compiling whole methods at a time, we probably get hosed by labels
> clearing CSE, but we'd probably also get killed by not supporting CSE inside an
> if statement, e.g.
> 
> function foo() {
>   if () {
>     /* tons of code */
>   }
> }

Oh, you're talking about the case where more code is maybe-taken than definitely-taken?  Hmm, that does complicate things.  That points towards the stack-of-Cse-tables idea, but I'm having trouble seeing how to implement that efficiently -- seems like it requires lots of copying.
Attached patch draft patchSplinter Review
I implemented the algorithm from comment 6.  It appears to work, but does remarkably little for TM code.  Scanning through the diffs of the generated code I could only see that lots of 32-bit immediates were being CSEd that weren't previously.  But that's not very interesting, as most 32-bit immediates get folded into other operations anyway.  I couldn't see any cases where anything more complex was CSE'd.  Instruction counts were barely changed, in some cases slightly worse (eg. 0.2%) because CseFilter's hash tables are more stressed than before.

Given this result, I'm disinclined to work on this any more, at least until something changes.
Oh, a downside of the patch is that names given to immediates with addName() become less-than-useless.  Eg. you give '2' the name JSVAL_DOUBLE and then every use of 2 gets that name even when it's not appropriate.  This already happens a bit but with the patch it's much worse.
Blocks: 563944
Whiteboard: PACMAN
Target Milestone: --- → Future
This patch takes the simplest possible approach to determining when to resume updating the CSE tables -- let the client (front-end) worry about it.  This is trivial for the intended use cases involving synthetic diamonds generated by the front-end, and avoids the difficulty noted in comment #7.
Attachment #471365 - Flags: review?(edwsmith)
Attachment #471365 - Flags: feedback?(nnethercote)
Attachment #471365 - Attachment is patch: true
I don't see what this gains -- does it enable CSE in the case where you have

  if (...) {
    tons of code
  }

?
I believe gal originally filed this to make typed array access faster on trace, but I don't remember the details at this point.
(In reply to comment #13)
> I believe gal originally filed this to make typed array access faster on trace,
> but I don't remember the details at this point.

Sure.  My question was about the constrasting approaches in Bill's patch and my patch, not the utility of the bug in general.
(In reply to comment #12)
> I don't see what this gains -- does it enable CSE in the case where you have
> 
>   if (...) {
>     tons of code
>   }
> 
> ?

CSE is only disabled from the suspendCSE() to the resumeCSE().  It is intended
that this would be done only for synthetic control flow diamonds resulting,
say, from inline expansion of an ABC opcode.  For example:

            // * -> Number
#ifdef VMCFG_FASTPATH_FROMATOM
            CodegenLabel not_intptr;
            CodegenLabel not_double;
            CodegenLabel done(LTy_D);
            suspendCSE();
            prepareLabelForMerge(done);
            LIns* val = loadAtomRep(index);
            LIns* tag = andp(val, AtomConstants::kAtomTypeMask);
            // kIntptrType
            branchToLabel(LIR_jf, eqp(tag, AtomConstants::kIntptrType),
not_intptr);
            LIns* v1 = p2dIns(rshp(val, AtomConstants::kAtomTypeSize));
            jumpToLabelWithValue(done, v1);
            emitLabel(not_intptr);
            // kDoubleType
            branchToLabel(LIR_jf, eqp(tag, AtomConstants::kDoubleType),
not_double);
            LIns* v2 = ldd(subp(val, AtomConstants::kDoubleType), 0,
ACCSET_OTHER);
            jumpToLabelWithValue(done, v2);
            emitLabel(not_double);
            LIns* v3 = callIns(FUNCTIONID(number), 1, val);
            fallToLabelWithValue(done, v3);
            LIns* res = emitLabelWithMerge(done);
            resumeCSE();
            return res;

Normally, you wouldn't generate "tons of code" in this case, and would leave
CSE enabled otherwise.  Of course, that means CSE state would be cleared at
labels outside of synthetic diamonds, but that's what happens already.  The
objective here is to hide the effects of well-behaved single-entry/single-exit
fragments with internal control flow so as not to inhibit optimization of the
surrounding code.
Nick: The problem with disabling CSE, as an alternative to a full-blown flow analysis with correct treatment of merge points, is that there is a tradeoff between losing CSE opportunities within the region in which CSE is turned off (because new expressions are not recorded) vs. losing opportunities following that region due to conservative clearing at labels.  Rather than attempting to infer the likely most profitable choice based, say, on the amount of code within the diamond, I simply leave it up to the client.  For the simple use cases envisioned, this sidesteps the issue of pessimizing arbitrary diamonds that may contain "tons of code".
(In reply to comment #16)
> For the simple use
> cases envisioned, this sidesteps the issue of pessimizing arbitrary diamonds
> that may contain "tons of code".

For simple diamonds is the behaviour of your patch (if used by the client in the way you suggest) any different to the behaviour of my patch (which doesn't require the client to do anything)?
Yes.  Your code does not distinguish small diamonds, such as might arise from inline expansion of an ABC bytecode or JS operator (e.g., think inlined fastpath for adding a pair of tagged integers), from larger diamonds that may arise from control flow within the script itself.

I suspect that in TM, the only diamonds are synthetic, and small enough to push the tradeoff in favor of suspending CSE.  Straight-line control flow is the norm
within a trace.

In TR, diamonds presently arise only from control flow within the script, and may include large amounts of code that would benefit from CSE.  There is no particular bias toward straight-line control flow, and supressing CSE within a basic block simply because the block is nested within a conditional could very easily negate the value of CSE entirely -- consider a conditional that wraps an entire method body.  At present, we reset CSE state at labels, which is safe, but means that expanding an ABC instruction to a code sequence involving a harmless branchover would end the current block and reset the CSE state.
Going forward, we want to be able to insert small branchovers in a way that does not disrupt CSE of the surrounding code.
Comment on attachment 471365 [details] [diff] [review]
Simple patch to suspend/resume CSE around synthetic control-flow diamonds

R+ because no bugs found, but we should resolve any concerns about the CSEFilter API and usage before landing.

One possible alternative for TR (methods with complex control flow, plus small inlined diamonds), would be to modify our frontend to skip CSEFilter while emitting inlined diamond-shaped code.  This is easy to do by mutating the filter chain on the fly, for example:

   LirWriter before_cse = /* before_cse -> cse_filter */;
   before_cse->out = cse->out;
   /* emit diamond */
   before_cse->out = cse;

This is less good than Bill's patch because code in the diamond cannot
match any expressions occuring before the diamond.  Most likely, inlined
code would only match constants.  But, the above code would not require
any changes to class CSEFilter.

To me this is a bit hacky but maybe it's the lesser of two evils?
Attachment #471365 - Flags: review?(edwsmith) → review+
(In reply to comment #19)
>
> To me this is a bit hacky but maybe it's the lesser of two evils?

Sounds good to me, as then things would just work out for TM :)
(In reply to comment #20)
> (In reply to comment #19)
> >
> > To me this is a bit hacky but maybe it's the lesser of two evils?
> 
> Sounds good to me, as then things would just work out for TM :)

Or for a little extra cost, but less hacky (in some eye), you could use a Tee filter, with one end by-passing the cse filter.
What's a Tee filter?
A non-existent class :)  We had it at one point, way back, it basically was a pass-through that fed 2 outputs from a single input.

Although, thinking more about it, in this case, we'd want to shut off 1 of the outputs, so its really similar to what Ed proposed above.
It would be possible to have a 'class LirWriterOption : public LirWriter' that wraps another LirWriter instance and allows it to be dynamically switched in or out of the pipeline.  I gather this was Rick was getting at with his comment above.  It would be more expensive than many of our pipeline stages, however, because it would insert another virtual method call for *every* API method, whereas many of our pipeline stages are selective in what they override.
I'd also like to preserve the pre-diamond CSE state within the diamond, rather than simply bypassing CSE entirely.

As an alternative to passing suspendCSE()/resumeCSE() down the entire pipeline, and thus impacting the LirWriter API, I could make these specific to the CseFilter class.  This would introduce a bit more complexity for the client, but should be workable.  The 'suspended' state variable itself, and the tests made against it during CSE, should be reusable by any similar mechanism that TM may introduce, including one where the diamonds are inferred automatically.
Revised patch does not alter the generic LirWriter interface, but exposes CseFilter::suspend() and CseFilter::resume().
Attachment #471365 - Attachment is obsolete: true
Attachment #476431 - Flags: review?
Attachment #476431 - Flags: feedback?(edwsmith)
Attachment #471365 - Flags: feedback?(nnethercote)
Attachment #476431 - Flags: review? → review?(nnethercote)
Attachment #476432 - Attachment is patch: true
Comment on attachment 476431 [details] [diff] [review]
Simple patch to suspend/resume CSE around synthetic control-flow diamonds (v2)

This bug is going around and around, which is frustrating... but I don't like the new patch.  You have to remember/know to put the suspend/resume calls around smalls diamonds, and it's possible to insert them in the wrong place which will silently cause bad code generation.  Two not-so-nice features.

TM and TR have quite different requirements here.  Can those differences be encapsulated in a boolean flag (compile-time or run-time) in CseFilter somehow?
Attachment #476431 - Flags: review?(nnethercote) → review-
(In reply to comment #27)
> Comment on attachment 476431 [details] [diff] [review]
> Simple patch to suspend/resume CSE around synthetic control-flow diamonds (v2)
> 
> This bug is going around and around, which is frustrating... but I don't like
> the new patch.  You have to remember/know to put the suspend/resume calls
> around smalls diamonds, and it's possible to insert them in the wrong place
> which will silently cause bad code generation.  Two not-so-nice features.

I don't really like it either but the problem is fundamentally that the heuristic that we are using to avoid the full-blown stack of CSE tables requires us to distinguish between small "well-behaved" diamonds and diamonds introduced by general control flow.  There is no simple property of the flowgraph that crisply distinguishes these two cases (though we could classify diamonds based on instruction count alone).

If I understand correctly, the only reason you don't see this as a problem for TM is that TM *only* produces such small diamonds, and thus can assume that any diamond that NJ can infer from the branches and labels should be treated in the same way.  This assumption is not warranted in general, and to make it tacitly embeds assumptions peculiar to TM into the semantics of LIR.

> TM and TR have quite different requirements here.  Can those differences be
> encapsulated in a boolean flag (compile-time or run-time) in CseFilter somehow?

The 'suspended' flag should be simply the logical complement of the "onAlwaysExecutedPath" in patch
https://bug542905.bugzilla.mozilla.org/attachment.cgi?id=443051 .

The existing CSE behavior is enabled by default.  If you do not use the suspend()/resume() API, nothing is changed.  Any other strategy to turn CSE on/off is compatible as long as it ultimately reduces to some policy for toggling the 'suspended' variable.  I simply implemented the simplest possible such policy.  The intent was not to preclude the implementation of other policies.  Given that we seem to be in agreement now that TM and TR have different requirements that are unlikely to be satisfied with a single unified policy, what exactly is the policy that TM would prefer?  I can then implement both policies and a selection mechanism.  Are you happy with what you proposed in your earlier draft, or was that nixed by the results you reported in comment #9 ?
(In reply to comment #28)
> 
> The existing CSE behavior is enabled by default.  If you do not use the
> suspend()/resume() API, nothing is changed.  Any other strategy to turn CSE
> on/off is compatible as long as it ultimately reduces to some policy for
> toggling the 'suspended' variable.  I simply implemented the simplest possible
> such policy.  The intent was not to preclude the implementation of other
> policies.  Given that we seem to be in agreement now that TM and TR have
> different requirements that are unlikely to be satisfied with a single unified
> policy, what exactly is the policy that TM would prefer?  I can then implement
> both policies and a selection mechanism.  Are you happy with what you proposed
> in your earlier draft, or was that nixed by the results you reported in comment
> #9 ?

How about we pass in a flag "autoSuspend" to CseFilter().  When it's true, CseFilter will automatically suspend/resume around diamonds;  this is what TM will use.  When it's false, it won't;  this is what TR will use.  That seems pretty simple and the overhead will be low because it'll only be consulted on branches and labels.  Are you happy to hack that up?

As for comment 9, TM's LIR generation has changed a lot since then so I'll remeasure once we have an updated patch.
Comment on attachment 476431 [details] [diff] [review]
Simple patch to suspend/resume CSE around synthetic control-flow diamonds (v2)

I've had a change of heart!  For a few reasons:

- Doing this automatically is tricky.

- In TM there are a few key diamonds that we want CSE to work around.
  Manually annotating them isn't hard, and I'd like it to work ASAP.

- Correctly inserting the suspend/resume pairs isn't that hard.

- It'll allow me to increase the CSE scope gradually.

Mostly, it's a "this is good enough for now" r+.  We can revisit it (eg.
doing it automatically) later if necessary.


>@@ -2478,17 +2481,18 @@ namespace nanojit
> 
>     LIns* CseFilter::insLoad(LOpcode op, LIns* base, int32_t disp, AccSet accSet, LoadQual loadQual)
>     {
>         LIns* ins;
>         if (isS16(disp)) {
>             if (storesSinceLastLoad != ACCSET_NONE) {
>                 // Clear all normal (excludes CONST and MULTIPLE) loads
>                 // aliased by stores and calls since the last time we were in
>-                // this function.  
>+                // this function.  Clearing aliased loads must be performed
>+                // even when CSE is suspended.

Nit: "Aliased loads must be cleared even when CSE is suspended."


>+        // If true, we will not update the CSE tables, but continue to recognize
>+        // and optimize expressions matching existing table entries.
>+        bool suspended;
>+

Nit: "If true, we will not add new instructions to the CSE tables, but
we will continue to CSE instructions that match existing table entries.
Load instructions will still be removed if aliasing stores are encountered."


>+
>+        // Preserve the current CSE table state until the next call to resume().
>+        // New expressions are not added to the tables, nor are they cleared at labels.
>+        // The intent is that a suspend()/resume() pair may delimit a synthetic
>+        // control flow diamond, preventing the inserted label from having an adverse
>+        // affect on CSE of the surrounding code.
>+        void suspend() { suspended = true; }
>+
>+        // Resume adding expressions to CSE tables.  All subsequent code up to the
>+        // next label must be dominated by the point at which the preceding suspend()
>+        // was invoked, otherwise CseFilter may yield incorrect results.
>+        void resume() { suspended = false; }

These comments overlap with the comments on 'suspended' above.  You can
strip them down to just discuss when/where they have to be inserted.

r=me with those nits addressed, no need to re-review.
Attachment #476431 - Flags: review- → review+
Comment on attachment 476431 [details] [diff] [review]
Simple patch to suspend/resume CSE around synthetic control-flow diamonds (v2)

Looks good to me
Attachment #476431 - Flags: feedback?(edwsmith) → feedback+
Capture pointer to CseFilter, if we are using one, in the CodegenLIR instance, and provide convenience methods to invoke CSE suspend/resume.
Attachment #478130 - Flags: review?(edwsmith)
Attachment #478130 - Attachment description: Tamaring helper methods: CodegenLIR::suspendCSE() and CodegenLIR::resumeCSE() → Tamarin helper methods: CodegenLIR::suspendCSE() and CodegenLIR::resumeCSE()
Attachment #478130 - Flags: review?(edwsmith) → review+
Nanojit import:

http://hg.mozilla.org/tamarin-redux/rev/645b96ef1a88

Helper methods:

http://hg.mozilla.org/tamarin-redux/rev/a57d966849a3
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: PACMAN, fixed-in-nanojit, fixed-in-tracemonkey → PACMAN, fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin
Assignee: nobody → wmaddox
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.