Closed Bug 575527 Opened 14 years ago Closed 12 years ago

implement a weak phi node

Categories

(Core Graveyard :: Nanojit, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: luke, Assigned: wmaddox)

References

Details

Attachments

(8 files, 3 obsolete files)

1.12 KB, text/plain
Details
3.31 KB, text/plain
Details
2.47 KB, text/plain
Details
54.96 KB, patch
Details | Diff | Splinter Review
13.55 KB, patch
edwsmith
: review+
Details | Diff | Splinter Review
24.12 KB, patch
Details | Diff | Splinter Review
5.39 KB, patch
Details | Diff | Splinter Review
1.90 KB, patch
wmaddox
: review+
Details | Diff | Splinter Review
I am interested in avoiding the following diamond pattern

phi = lir->insAlloc(...);
branch = lir->insBranch(c);                  // if (c) {
x = ...                                      //   x
lir->insStore(x, phi, ...);                  //   phi = x
from_then = lir->insBranch(LIR_j);           // }
branch->setTarget(lir->ins0(LIR_label));     // else {
y = ...                                      //   y
lir->insStore(y, phi, ...);                  //   phi = y
from_then->setTarget(lir->ins0(LIR_label));  // }
z = lir->insLoad(..., phi);                  // z = phi

by building a highly restricted version of the classic phi node that only handles joining values from explicitly declared forward branches:

branch = lir->insBranch(c);                  // if (c) {
x = ...;                                     //   x
from_then = lir->insBranchToPhi(LIR_j);      // }
branch->setTarget(lir->ins0(LIR_label));     // else {
y = ...;                                     //   y
                                             // }
z = lir->insBranchWithPhi(from_then, x, y);  // z = phi(x, y)

One can imagine insBranchWithPhiN that allowed more incoming edges.

AFAICS, in this restricted form, register allocation wouldn't need significant changes: when the backwards scan reached the insBranchWithPhi, the allocate state would give z's register to y, continue up to insBranchToPhi, then give z's register to x.

Such a phi node would allows us to avoid js_UnboxDouble without the DCE/CSE problems associated with insAlloc'd phi nodes.  There may be a few other phi diamonds we could avoid as well.

I know Andreas was talking about another plan to add phi nodes.  Andreas: is that plan equivalent to or more powerful than what is described above?  If it is more powerful, then perhaps this weak phi node would be a good initial step in that it wouldn't be a huge change or make register allocation significantly more expensive (at least, if I haven't overlooked something important; what do you think Nick?).
I would say lets hack it up and see what it looks like. We have a more generic plan in mind, and an entire week to hack on hit next week as it turns out :) But this seems simple and might provide a big benefit with a few lines of code added.
Shouldn't this bug be in Nanojit? Sorry if I mis-classified optimistically...

/be
Assignee: general → nobody
Component: JavaScript Engine → Nanojit
QA Contact: general → nanojit
yeah, njn for sure.
OS: Linux → All
Hardware: x86 → All
I spent a fair amount of time studying the code in intersect/unionRegisterState, thinking about phi node resolution, and I agree, the forward branch case should be straightforward.  It also would be smart of us to incorporate Gal's patches which improve the reconciliation logic.

In the general case we'll also have to handle stack-to-stack copies.  Say z has spilled downstream; x and y will both have to copy their values to z's stack location in addition to leaving their values in whatever register z was assigned (possibly none).  but it should be doable.

ordinarily that's easy and is just a register->stack store (leaving x or y assigned to just a register); if there is code with more phis than registers, (even with just two branches) then reconciliation can also cause spilling and possibly cause spills.  still, it doesn't seem any worse than the tricky corner cases we already have.
(In reply to comment #2)
> Shouldn't this bug be in Nanojit? Sorry if I mis-classified optimistically...

(In reply to comment #3)
> yeah, njn for sure.

njn != NJ :)
(In reply to comment #0)

> z = lir->insBranchWithPhi(from_then, x, y);  // z = phi(x, y)

Just curious, why "insBranchWithPhi()" vs "insPhi()"?
(In reply to comment #6)
> Just curious, why "insBranchWithPhi()" vs "insPhi()"?

Oops!  What I meant to write was "insLabelWithPhi" or "insLabelAndPhi" or even "insJoinPointWithPhi".  "insPhi" seems reasonable, I'm just used to seeing the phi node as a statement after the join point, not being the join point itself.
prototype work in progress at
http://hg.mozilla.org/users/edwsmith_adobe.com/nanophi

* extended LIR_label to point to a chain of phi nodes
* added LirWriter::insLabel() instead of ins0(LIR_label)
* added LIR_phi[iqd] instructions, which take argc/args[].  phi instructions are emitted into the LirWriter pipeline like other instructions (lirWriter->insPhi(...)) but are not emitted into the instruction buffer, instead they are allocated in the same arena and pointed to by the label and by users of the phi value.

known limitations:
* we will later want a way to extend the phi args incrementally as code is being generated (as merge conflicts are discovered)
* lirasm requires phi to take exactly 2 args.  (need more lirasm mojo)
* Assembler changes not implemented yet.  anticipating bug 514102
Blocks: 578917
Any news on this?
Hi, Robert.  I've been looking at this issue, and plan to pick this up and run with it.  I'm partial to the approach Edwin has taken in comment #8.  I see one difficulty, though, in that there is no provision to associate the phi arguments with the incoming flow edges to which they correspond.  The textbook solution would assign an order to the incoming flow edges, and associate them with the phi arguments positionally, but this is awkward without an explicit CFG.  I propose instead that the phi arguments be represented as an unordered set of LIns* pairs, where one member of the pair is the branch instruction from the preceding block, or NULL if we reached the label via fallthrough.  This representation should also make it easier to add phi arguments incrementally if we move to the general use of SSA beyond the synthetic diamonds considered here.

During assembly, we merge register state from the target label at forward branches. When we pick up the state of a register from a label (LabelState) currently holding the value of a phi node,  we replace the phi node in the register state with the phi argument belonging to the branch.  In this way, the register reservation for the phi node is given to the correct phi argument in each predecessor block as Luke proposes, but allowing for arbitrary forward branching rather than a simple diamond.
Make it so!  we can always change the LIR representation of phi inputs later on to be the textbook solution.
Assignee: nobody → wmaddox
Bill, any progress here?
Hi, Nick.  This has been a priority for me the last few days, and I've been on a "deep dive" into regalloc and SSA.  I've worked up a dirty but functional prototype of the idea in comment #10.  It turns out that the register reconciliation we have to do as a result of live-range splitting (inherent in our design) is exactly the sort of parallel assignment that is required during phi-elimination.  (This is much clearer with the patch for bug 514102, as the
current implementation just spills its way around the hard part of permuting
a set of registers.)  With all variables in registers, I can handle the phis on a forward branch with a simple modification to unionRegisterState():
The line
    LIns* savedins = saved.getActive(r);
becomes
    LIns* savedins = selectPhiArg(saved.getActive(r), fromBranch);
where selectPhiArg() returns the phi argument associated with the incoming edge if the active insn is a phi, otherwise it just returns the active insn.  Although I havent' tried it yet, I believe this same approach can be applied uniformly in the register reconciliation process to handle backward flow as well.

The difficulty occurs with phi nodes that have spilled downstream. Because we do not allow for holes in the lifetime of spill slots, we do not presently need to reconcile them at control flow merges.  Proper assembly of phi nodes and phi arguments requires that we do so, however.  Conceptually, we reconcile the locations of the live results, but the present implementation optimizes this by considering only those results that are in registers.

For forward merges only, naive phi elimination works, so I added a separate reconciliation step to scan the phi nodes at a label and generate copies when either the argument or phi result was on the stack, deferring the register-register case to the existing reconciliation step, as modified above.  It's a bit ugly, though, and currently punts on non-integer phis and non-x86 processors.  Starting with Andreas' register-shuffling code, I think it will be straightforward to combine this hacky reconciliation step with the existing register state merge in a way that will handle the general case.  This would would bring us tantalizingly close to full SSA support.

I've looked at several "halfway house" simplifications, hewing to the spirit of Luke's original proposal.  If we restrict phis to labels reached only by forward flow *and* require that a phi argument must be the sole use of its definition (a kind of "linear SSA", as in linear logic), then Luke's implementation sketch works as proposed: we simply give the allocation of the phi node to each of its arguments in turn, and no copies are needed.  The dificulty here is in maintaining the stack reservation for a spilled phi until all of the predecessors are done with it.  This can be done by allowing multiple instructions to share the same stack reservation, using a reference count to deallocate it when all users have released it.  While we currently maintain a back-link from the reservation to a single owner, we don't seem to be using it for anything, unlike the case for registers.  My current conjecture, however, is that the extra code would be better spent removing the single-use restriction by allowing for stack-stack and stack-register copies, and treating any stack allocation monkey-business as a possible future optimization.

I note that the simple use case for control diamonds allows for the trivial and efficient removal of the phi by computing each argument into a commmon location, effectively coalescing the moves that would be inserted by naive phi elimination.  This is the essence of Luke's simple implementation strategy. Since we are not doing any optimizations that benefit from full SSA (though our current "almost SSA" is leveraged in several useful ways), it is worth considering introducing multiply-assigned variables that are treated analogously to alloc/st/ld, but which are preferentially allocated to registers and spilled only when necessary, say var/def/use.  We would like for the var instruction to hold a register over its lifetime, into which multiple exptressions may compute their results.  Unfortunately, there is no way to provide a target register into which an expression result is to be placed without giving the register to the expression as its allocation, so extra registers and copies would be needed.  In this sense, the single-assignment constraint is deeply ingrained in our register allocator.  After looking at a few infeasible variants of this idea, I reluctantly conclude that phi nodes are the natural way to introduce simple assignable temporaries within our register allocation framework, even though that buys into SSA much more completely than our optimizer can exploit at present.

I am actively working on this, and expect to continue through the weekend.  I expect to get a patch or two up for discussion in a few days.
Good to hear you're deep into this.  However, what you wrote sounds a lot more complicated than I expected.  For TM at least we just want small control flow diamonds, where each path defines a single value.

> Since we are not doing any optimizations that benefit from full SSA (though our
> current "almost SSA" is leveraged in several useful ways), it is worth
> considering introducing multiply-assigned variables that are treated
> analogously to alloc/st/ld, but which are preferentially allocated to registers
> and spilled only when necessary, say var/def/use.  We would like for the var
> instruction to hold a register over its lifetime, into which multiple
> exptressions may compute their results.

You lost me here.  "Multiple-assigned variables" doesn't sound like SSA to me.

If there is a large design space here, perhaps it is best to start with something simple?
(In reply to comment #14)
> Good to hear you're deep into this.  However, what you wrote sounds a lot more
> complicated than I expected.  For TM at least we just want small control flow
> diamonds, where each path defines a single value.

It would be possible to tighten up the representation of phi nodes slightly by requiring that a label have at most one phi node with exactly two arguments.  Edwin had already written the code for the more general case, however, and I found it convenient to work with at present.  The sticky issues are not helped by this, though -- as far as I can tell, it is no more difficult to handle the general case of forward merges than a single diamond.  Any significant simplification then is a matter of cutting corners, legislating away awkward cases by adding restrictions.  Considering only forward merges is itself one.  Requiring that phi args be used only once is another.  I come away with the suspicion, however, that these restrictions add as many problems as they solve, particularly with assertions all over the place to enforce them.  So I'd like to explore a bit larger part of the design space, and therefore have some better data about the tradeoffs involved in trimming some of the complexity away.  My intuition is that the general case is only marginally more difficult.  Note that I'm not talking about converting user code into SSA,
just handling phi nodes correctly once they are inserted, which is very similar to what we already do with register reconciliation.

> You lost me here.  "Multiple-assigned variables" doesn't sound like SSA to me.

Exactly.  We are only SSA now if you ignore the use of scalar variables on the stack, which are essential to getting any work done w/o phi nodes.  We could have, in principle, explicitly-marked register-resident temporaries that do not obey the SSA restriction, but instead are treated, e.g. for optimization purposes, like the memory-resident temporaries we already have.  This is appealingly simple and intuitive as a lightweight solution,  and would be trivial to implement in some contexts, but apparently not in ours.  This is an abandoned blind alley I had started to explore some time ago, but thought it was worth mentioning.
This patch implements phi nodes for targets of forward branches.  It builds on Edwin's earlier work (see comment #8).  Originally, it was build on top of Andrea's improved state merging patch (see bug 514102), but turned out to be
largely separable from that, so only a few snippets remain, in the interest in
minimizing the size of the present patch.

The major simplification is that phi node results are required to begin their lifetimes in a register, which allows them to be handled in a manner more closely aligned with ordinary instructions.  This obviates the need for stack-to-stack copies and/or reassignment of stack slot allocations, and avoids the need for a separate reconciliation process similar to register state merging for phi nodes that are not in registers.

The emphasis here is on the translation aspects, and both the API for creating phi nodes and the lirasm syntax could use a bit more work.  The code currently takes the expedient of treating the argument list of a phi node as an alternating list of (key, phi argument) pairs, where the key is the branch from the predecessor block, or the block label for a fallthrough edge.  This is a bit clumsy, and it would be cleaner to attach a single list of predecessors to the label and index the phi arguments positionally.
Attachment #464614 - Flags: feedback?(edwsmith)
This patch implements a version of the inlined atomToDouble conversion described in Bug 562744, using phi nodes.  This is a simple case of a nested control diamond, using two tag checks to select between three different means of computing the result.  It has been built and tested on MacOS X / x86 for 32-bit (sse2 and x87) and 64-bit (sse2) configurations, and passes all acceptance tests.
Attachment #464617 - Flags: feedback?(edwsmith)
(In reply to comment #16)
> Created attachment 464614 [details] [diff] [review]
> Implement phi nodes on targets of forward branches

Observations:

1) This scheme generates the tightest code for the simplest use case, in which each phi argument is used exactly once in that role.  The phi arguments are simply evaluated into the phi-node result register, coalescing the copies dictated by naive phi-node elimination.  The situation is more complicated if a phi argument has further uses downstream, as it must not be overwritten by phi arguments corresponding to other incoming edges.  This can be done most cleanly by performing a register-register copy from the phi argument's register to the phi node's register, leaving the original phi argument intact.  Unfortunately, if this copy needs to be performed on the taken edge of a branch, there is nowhere to put it, as we don't have the infrastructure for splitting arbitrary edges.  We finesse the issue by spilling all phi arguments passed on a taken branch that are subsequently referenced (i.e., not dead following the phi node).  The necessary copy is implemented by the spill and restore.  Note that at least one spill will be performed, and if execution reaches the restore, we will have paid the same price as if the merge had been implemented without phi.  There may still be a win in avoiding back-to-back stores and loads, but it may turn out to be a wash.  Indeed, requiring a register for the phi node may force an additional eviction, leading to a net loss.  It is therefore advisable to arrange the code so that phi arguments whose lifetime extends beyond the phi are transmitted on the fallthrough edge if possible.

2) This scheme is compatible with Andreas' recent register merging patch, which replaces spill/restore during register state merging with register to register copies when possible.  A slightly earlier version of the present patch was integrated with Andreas' patch, and appeared to work correctly.

3) The register copy on line 1924 of nanojit/Assembler.cpp does not take into account cyclic dependencies among the phi nodes and arguments.  This situation should not occur naturally in absence of loops, but can be created artificially and is not detected.  If desired, the planMove and shuffleRegs routines from Andreas' patch could be applied here to handle cycles correctly.
Attachment #464614 - Flags: feedback?(nnethercote)
Attachment #464614 - Flags: feedback?(lw)
Attachment #464614 - Flags: feedback?(gal)
One question whose answer I am not experienced enough to discern from the patch: will uses of phi nodes be as CSEable and DCEable as, say, a LIR_cmov?
Using a control-flow diamond and a phi node will not be as efficient as a LIR_cmov if you are simply selecting between two values that have already been computed or are so cheap to compute as to offset the (not inconsequential) cost of a branch.  Introducing branches and labels will impede CSE.  On the other hand, if the objective is to avoid an expensive procedure call on one branch of the diamond, as in my Tamarin use case, LIR_cmov would be useless.
Does "Introducing branches and labels will impede CSE" mean a phi node cannot be CSE'd ever?  What about DCE?  Is there any way that the restrictive form of these phi nodes will allow CSE/DCE to work around the branches and labels?
The isResultLive flag should be correctly propagated by phi nodes, i.e., the arguments of a phi node will be marked live only if the phi node itself is live.
Phi nodes are attached to the label they belong to, and are not actually included in the LIR instruction stream, so CSE won't see them.  Since we clear the CSE table at each label (thus CSE only within a basic block), the only opportunity for CSE of phi nodes would be amongst phi nodes belonging to the same label.

I expect that that these phi nodes are neutral with respect to the optimizations that we currently perform.  Note that aggressive optimization involving phi nodes might break the structural constraint, requiring a fully-general implementation of phi nodes (i.e., at targets of back-edges) for correctness.
The reason I ask is that it would be nice to be able to almost always prefer phi nodes to cmov instructions (http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus).  However, if part of the decision is "will I lose CSE opportunities by using the phi node?", then I would probably restrict my use of phis to the case you described: where one branch is a function call.  Either way, any phis are better than no phis :)
(In reply to comment #23)
> (http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus). 

Is this still true on post-P4
(In reply to comment #23)
> (http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus). 

Is this still true on post-P4  x86 systems? I have a vague memory of Moh Haghighat of Intel (now cc'ed on this bug) indicating that cmov was indeed worth using on Core systems.
Recent Intel optimization manuals also advise against it unless you are really sure.
(In reply to comment #25)
> (In reply to comment #23)
> > (http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus). 
> 
> Is this still true on post-P4  x86 systems? I have a vague memory of Moh
> Haghighat of Intel (now cc'ed on this bug) indicating that cmov was indeed
> worth using on Core systems.

Could always try the benchmark that was included in that thread.

(At this point I have not gone through the effort of building it, but at least I decoded it from the base64 attachment in the thread.  I post it as an attachment in a moment.)
aforementioned benchmark, snarfed from above referenced linux-kernel email thread.
(In reply to comment #23)
> The reason I ask is that it would be nice to be able to almost always prefer
> phi nodes to cmov instructions
> (http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus).  However, if
> part of the decision is "will I lose CSE opportunities by using the phi node?",
> then I would probably restrict my use of phis to the case you described: where
> one branch is a function call.  Either way, any phis are better than no phis :)

The main thing I'd be concerned about for now is that using a phi requires a label, and a label has cost.  (resets CSE).  If you already want to compute two values, and pick one, then a LIR_cmov instruction seems like the right way to express it, regardless of machine issues.

If a conditional branchover is preferred to an actual machine cmov, then the backend for that CPU should emit the code that way.  (In fact, we already emit code that way on PPC, which doesn't have a CMOV instruction).

If a frontend doesn't want to take the chance and knows that the condition will be highly biased one way or the other, then explicitly branching and using a phi, seems like the right choice.
looking back, making LIR_cmov "optional" seems like a bad choice, given the pain it has caused by rippling cpu-detection up through the layers of the JIT.  Even on x86, a conditional branch plus a mov, is probably only 1 byte longer than a cmov, on older cpu's, and the alternatives (explicit branching, or bitbanging) are much more expensive when expressed in LIR.
(In reply to comment #25)
> Is this still true on post-P4  x86 systems?

Linus' remarks in the email exchange above, including a microbenchmarking experiment, indicates this is the case.  The reasoning seems sound for predictable branches and agressively out-of-order CPUs.  Don't forget the embededded platforms, however -- speculation costs power, so in-order designs remain common in the embedded world.

We might consider generating LIR_cmove where the semantics are a good fit, and
lower to a branchover following CSE on platforms where it makes sense, c.f. the
Specializer in Tamarin.
(In reply to comment #30)
> looking back, making LIR_cmov "optional" seems like a bad choice, given the
> pain it has caused by rippling cpu-detection up through the layers of the JIT. 
> Even on x86, a conditional branch plus a mov, is probably only 1 byte longer
> than a cmov, on older cpu's, and the alternatives (explicit branching, or
> bitbanging) are much more expensive when expressed in LIR.

Agreed. IMHO LIR_cmov should be expressible everywhere as a conditional move, with processor-specific backends deciding the right implementation (so perhaps x86 won't actually use cmov). 

Might also make sense to augment it a bit to express whether one value is more likely than another, so that a test-and-branch implementation can order it to maximize branch prediction.
I did a bit benchmarking of my own. Linus' remarks from back when are no longer accurate on my core i7 laptop.

For the floating point cmovd I am already emulating the cmov with a branch on most platforms. We should do the same with cmovi/cmovp and implement them everywhere.
(In reply to comment #33)
>
> For the floating point cmovd I am already emulating the cmov with a branch on
> most platforms. We should do the same with cmovi/cmovp and implement them
> everywhere.

Yes.  I filed bug 586472 for this.

> The main thing I'd be concerned about for now is that using a phi requires a
> label, and a label has cost.  (resets CSE).  If you already want to compute two
> values, and pick one, then a LIR_cmov instruction seems like the right way to
> express it, regardless of machine issues.

Fortunately I have a cunning plan:  bug 542905.  There's even a patch;  when I wrote it, it didn't help TM but that might have changed by now, and if we start using phis heavily it will almost certainly change.

> If a frontend doesn't want to take the chance and knows that the condition will
> be highly biased one way or the other, then explicitly branching and using a
> phi, seems like the right choice.

The classic cases is when the unlikely case involves a C call.  Then the short-circuiting (aka. lazy) evaluation is critical.
Bill, Ed and I had a long discussion on IRC about the patch and concluded that the API for creating phis could use some tweaking.  (Maybe the lirasm syntax could be tweaked similarly?)  Bill will make those changes and post an updated patch.
Comment on attachment 464617 [details] [diff] [review]
Tamarin patch incorporating inlined atom->double conversions using phi support

this is just feedback, and the patch is basically okay.  

the change from ins0(LIR_label) to insLabel() should be separated from patches that start using phis, of course.

The code in #ifdef VMCFG_FASTPATH_FROMATOM would be easier to follow if it were factored into a subroutine, and prefixed with some pseudocode.  A blank line inbetween each chunk of C++ that generates a basic block also would help.

As we start dropping more and more inline sequences into the code, we have to be on the lookout for code structure messes; in particular, we need to leave droppings that say "we're emitting inline code for such-and-such C++ function, please keep them in sync"  (even if the C++ function would be dead code, it should be written out somewhere).
Attachment #464617 - Flags: feedback?(edwsmith) → feedback+
Attachment #464614 - Flags: feedback?(edwsmith)
The correspondence that Steven mentions in comment #25 was through the Tamarin-devel mailing list back in April 2008. I've attached it. 

Basically, if a branch is predictable, the CMP/JMP is better than CMOV; otherwise, CMOV is better, and in the out-of-order execution engines, the performance advantage of CMOV can be substantial. However, one has to avoid agressive use of CMOVs as speculative work increases.

As mentioned in the attachment, there is a pattern that is not particularly suitable for CMOV. If you are computing min or max of a set of elements in a short loop, you basically end up having something like x <- op (x, y), i.e., the destination is the same as one of the operands. By using CMP/JMP, you can avoid the implied x <- x move. That is, use of CMOV introduces this undesired data dependence that could have been avoided. However, if you have x <- op (y, z), use of CMOV is fine. Generally, one has to avoid CMOV latencies that can pile-up in loop-carried data dependencies.

Here's a re-run of the test benchmark I had created back then on some older systems as well as some new ones.


          The performance impact of using CMOV instead of CMP/JMP on IA

                 Pentium-4     Core-2    Core-i7    Atom
  Predictable      -30%          +1%       -10%      -1%
Unpredictable      +14%         +19%       +30%      +5%

The IA optimization guide covers this topic in Section 3.4.1.1: "Eliminating Branches" (page 110): http://www.intel.com/assets/pdf/manual/248966.pdf

So, the question boils down to our knowledge about the predictability of branches and the frequency of the unpredictable ones. A selective hybrid approach might be the best choice. 

-moh
The test mentioned in comment #37
This patch reflects some recent IRC discussion with Nick and Edwin.
There are two major changes:

1) The nanojit-level API conforms to the usual scheme in which all of the children of an instruction are presented at the time that the instruction is created. Previously, insPhi() took a label argument, and chained each phi to the label as a side-effect.

2) The helper functions used in CodegenLIR abstract away the phi nodes, and present a cleaner model in which labels are viewed as continuation functions that take arguments.  Here is the inlined atom2double:

            CodegenLabel not_intptr;
            CodegenLabel not_double;
            CodegenLabel done(LTy_D);
            LIns* val = loadAtomRep(index);
            LIns* tag = andp(val, AtomConstants::kAtomTypeMask);
            branchToLabel(LIR_jf, eqp(tag, AtomConstants::kIntptrType), not_intptr);

            LIns* v1 = p2dIns(rshp(val, AtomConstants::kAtomTypeSize));
            jumpToLabelWithValue(done, v1);
            emitLabel(not_intptr);
            branchToLabel(LIR_jf, eqp(tag, AtomConstants::kDoubleType), not_double);
            LIns* v2 = ldd(subp(val, AtomConstants::kDoubleType), 0, ACCSET_OTHER);
            jumpToLabelWithValue(done, v2);
            emitLabel(not_double);
            LIns* v3 = callIns(FUNCTIONID(number), 1, val);
            fallToLabelWithValue(done, v3);
            return emitLabelWithMerge(done);

I've only implemented the slightly sugared form for the common case where a single merged result value is needed.  The scheme generalizes to multiple merged results at the cost of passing a length and an array pointer in place of the single values, and additional function calls, e.g., label.getResult(n),
to retrieve the phi nodes at the merge label.

Issues/etc.:

1) Naming of the nanojit API functions could use more thought, and existing conventions are inconsistent and could be rationalized.

2) Positional indexing of phi arguments is useful.  This is not so clear for the predecessors of a label and its collection of phi nodes.  This is the established precedent, however, for variable-arity LIR opcodes.

3) Generalizing to N merged values at a label will make CodegenLabel considerably heavier.

4) In CodegenLIR::emitLabelWithMerge(), two vectors of LIns* (one for values and one for branches) are created, only to be copied into identical vectors in nanojit's arena, and then discarded.
Attachment #466862 - Flags: feedback?(edwsmith)
Attachment #466862 - Flags: feedback?(nnethercote)
Attachment #466862 - Attachment is patch: true
Attachment #466862 - Flags: feedback?(lw)
Comment on attachment 466862 [details] [diff] [review]
Tamarin patch incorporating inlined atom->double conversions using phi support (revised API)

>+                case LIR_q2d:
>+                    countlir_fpu();
>+                    ins->oprnd1()->setResultLive();
>+                    if (ins->isExtant()) {
>+                        asm_q2d(ins);
>+                    }
>+                    break;

This should probably go in a separate bug.


>+        // Unlike other LIR operators, phi nodes are attached to a label
>+        // and are not in the code stream.

Nit: "LIR instructions" is better than "LIR operators".


>diff --git a/nanojit/LIRopcode.tbl b/nanojit/LIRopcode.tbl
>--- a/nanojit/LIRopcode.tbl
>+++ b/nanojit/LIRopcode.tbl
>@@ -127,19 +127,19 @@ OP___(retd,      8, Op1,  V,    0)  // r
> 
> OP___(livei,     9, Op1,  V,    0)  // extend live range of an int
> OP_64(liveq,    10, Op1,  V,    0)  // extend live range of a quad
> OP___(lived,    11, Op1,  V,    0)  // extend live range of a double
> 
> OP___(file,     12, Op1,  V,    0)  // source filename for debug symbols
> OP___(line,     13, Op1,  V,    0)  // source line number for debug symbols
> 
>-OP_UN(14)
>-OP_UN(15)
>-OP_UN(16)
>+OP___(phii,     14, Phi,  I,    0)  // phi node for int
>+OP_64(phiq,     15, Phi,  Q,    0)  // phi node for quad
>+OP___(phid,     16, Phi,  D,    0)  // phi node for double

LIRopcode.tbl is the primary documentation source for LIR opcodes.  Can you
please give a more detailed description about phis and labels and how they
work together?  Thanks.


There's a function called staticSanityCheck() in LIR.cpp.  You need to add
additional static assertions for each of LInsLbl and LInsPhi.  Just copy the
shape of the existing ones.


ValidateWriter needs to check labels and phis.


insLabel() now has four arguments.  If I'm adding just a vanilla label
without any phis, what happens -- do I pass 0 and NULL args?  If so, a
zero-arg helper in LirWriter would be useful.


I wanted to apply this to TM and see what needed changing, but the patch has
bitrotted somewhat.  Can you please update it?  Thanks!


I'll give a f+ because the direction is looking good.
Attachment #466862 - Flags: feedback?(nnethercote) → feedback+
Revised phi nodes patch, incorporating feedback from Nick and Rick.
Implements phi node display in verbose output, which was previously
omitted in some contexts.

This patch includes only the patches to the 'nanojit' directory,
so it may be applied either to njc or tr.
Attachment #464614 - Attachment is obsolete: true
Attachment #468462 - Flags: review?(edwsmith)
Attachment #468462 - Flags: feedback?(rreitmai)
Attachment #464614 - Flags: feedback?(nnethercote)
Attachment #464614 - Flags: feedback?(lw)
Attachment #464614 - Flags: feedback?(gal)
This patch adds lirasm support for phi nodes to njc.  The syntax is a rather direct rendering of the procedural API, and does not attempt to create a sugared user-friendly assembly language.  Includes some phi node tests for the
testlirc.sh script.
Attachment #468463 - Flags: review?(edwsmith)
Attachment #468463 - Flags: feedback?(rreitmai)
Attachment #468462 - Flags: review?(nnethercote)
This is an example of the use of phi nodes in Tamarin.  It combines a proposed set of helper functions for working with phi nodes with another patch to inline atom to double conversions, see bug 562744.  These should likely land as separate patches, however, this combined patch is useful for testing and evaluating the nanojit phi node patches.  It requires the nanojit patch https://bug575527.bugzilla.mozilla.org/attachment.cgi?id=468462 for phi nodes, as well as another to implement the LIR_q2d instruction.
Attachment #464617 - Attachment is obsolete: true
Attachment #466862 - Attachment is obsolete: true
Attachment #466862 - Flags: feedback?(lw)
Attachment #466862 - Flags: feedback?(edwsmith)
Early experiments with a larger set of inlining patches on IA32 showed disappointing results using phi nodes compared against using temporary stack slots.  This may be an artifact of x86, which goes to great lengths to optimize such memory references in order to compensate for its small register set.  Experiments on other platforms are in order.

A weakness of the current design is that the forced allocation of a register to hold a phi node result may cause a spill, in which case the generated code will
be at least as bad as had a stack temporary been used in the first place.  It only makes sense to use a phi node when a register is free to hold it.
Comment on attachment 468463 [details] [diff] [review]
lirasm support for phi nodes

R+ because I can't find any bugs, and since this depends on landing the PHI support anyway.

Now that i've been away from this for a while, and coming back, I have to struggle to parse the phi + label cliche's in the *.in test cases.  If verbose output looks like this, it will be pretty tough on the brain.  and, well, verbose output and lirasm should be as similar as possible.

* l0 = ... is hard to read, L0 would be easier (yes, nit)

* having "foo:" imply a LIR_label instruction with the name 'foo' seems nice until you have to explicitly provide arguments to the label instruction, then switching to "foo= label ..." seems weird.  out of scope for this bug, but a better syntax might be:  "foo: label ..." where the "label" instruction keyword is always required (since it must be in LIR) and the "foo:" prefix is just an assembler name for the label (always required).  in retrospect, its a bug that lirasm will fill in the LIR_label instruction for you... assemblers just shouldn't do that.  in traditional assemblers, "foo:" creates a symbol 'foo' whose value is the address of whatever comes next (but nothing is generated next).

* overloading * to mean fallthrough is giving me heartburn just because my brain wants * to mean something wildcard-y.

* to the naked eye, the fallthrough path before each label will execute the phi. I know it won't, and I know the phis are just "attached" to the label, but I still trip up when reading because I'm so used to seeing the phi's after the label in SSA literature.  this would be more clear (and admittedly harder to implement in lirasm) (stolen from phi2.in):

   L0 = label *, jo: {
      a2 = phii a1 b1 
      b2 = phii b1 a1
   }
Attachment #468463 - Flags: review?(edwsmith) → review+
Comment on attachment 468462 [details] [diff] [review]
Implement phi nodes on targets of forward branches (v2)

Cancelling review to clear my queue.  This bug needs a sponsor to move forward, assuming the benefits in TR are neutral.

Bill, would it be possible to run some experiments on brightspot that count the static # of loads and stores, using LIR_phi vs explicit LIR_alloc/LIR_st/LIR_ld? 

That might show that while LIR_phi could shift spills around, it might still be a net win by removing more stores & loads than it introduces.

The first TR use case is inlining, atomToDouble(), and it looked like the first TM use case was js_unboxDouble().  If a simple phi doesn't help in either case, then this probably becomes WONTFIX, or maybe (later) we expand the scope of the experiment in TR to attempt full phi support, then use it for user-level variables in addition to inlined helpers. (bigger, longer, project).
Attachment #468462 - Flags: review?(edwsmith) → feedback?(lw)
Here's a tiny patch needed to make TraceMonkey work with this change -- it just replaces ins0(LIR_label) with insLabel().

Question:  I see this kind of thing in the verbose output:

      jt ltui1 -> label3 [jt1]

The "[jt1]" bit is new -- is the idea that you want to number all the jumps?  If so, it would be nice to follow the precedent set by guards:

      xt2: xt eqi3 -> pc=...... imacpc=(nil) sp+24 rp+0 (GuardID=006)

This may be that's TM-specific, but it still seems like a reasonable precedent :)

I'm about to try adding some phi nodes to TM, I'll undoubtedly have more comments soon.
Attachment #479648 - Flags: review?(wmaddox)
It's been long enough since our earlier discussions that I've pretty much
forgotten how phis are encoded in LIR.  So I was hoping that looking at the
following things would give me enough info to work out how to do phi stuff:

- The description of phis in LIRopcode.tbl
- The signatures for LirWriter::ins{Label,Phi}()

Alas, no.  Is a control-flow predecessor a branch?  What are the args in an
insPhi() call?  I was hoping to see a simple example in LIRopcode.tbl, that
would be more helpful than the wordy comment.

In other words... here is how I build a simple one-value diamond using fake
phi nodes:

    phi = lir->insAlloc(...);
    br1 = lir->insBranch(LIR_jf, c);
    x = ...
    lir->insStore(x, phi, ...);
    br2 = lir->insBranch(LIR_j);
    br1->setTarget(lir->insLabel());
    y = ...
    lir->insStore(y, phi, ...);
    br2->setTarget(lir->insLabel());
    z = lir->insLoad(..., phi);

How do I do this with real phi nodes?  I'm really having trouble working
this out.  I looked at TR but the extra layer on top of the basic LIR
interface confused me.
Attachment #479648 - Flags: review?(wmaddox) → review+
(In reply to comment #48)
> Created attachment 479648 [details] [diff] [review]
> TM patch (against TM 54638:da5a3ef843f8)

> Question:  I see this kind of thing in the verbose output:
> 
>       jt ltui1 -> label3 [jt1]
> 
> The "[jt1]" bit is new -- is the idea that you want to number all the jumps?

Yes.
 
> If so, it would be nice to follow the precedent set by guards:
> 
>       xt2: xt eqi3 -> pc=...... imacpc=(nil) sp+24 rp+0 (GuardID=006)

I suppose that would work, but one must be careful not to interpret the
'xt2:' prefix as if it were a lirasm-style label, which would implicitly
insert a LIR_label instruction.  It is simply a name for the instruction,
as would be the case if one wrote 'xt2 = xt ...'  I actually used that
notation in lirasm, because I was attempting leverage the existing machinery
as much as possible, and did not want to invent a bunch of new syntax or
rework the fundamentals of the assembler, though the resulting appearance is odd and a bit ugly.  Verbose output is easier, so I went with something that was pleasant to look at.
(In reply to comment #50)
> > If so, it would be nice to follow the precedent set by guards:
> > 
> >       xt2: xt eqi3 -> pc=...... imacpc=(nil) sp+24 rp+0 (GuardID=006)
> 
> I suppose that would work, but one must be careful not to interpret the
> 'xt2:' prefix as if it were a lirasm-style label, which would implicitly
> insert a LIR_label instruction.  It is simply a name for the instruction,
> as would be the case if one wrote 'xt2 = xt ...'

Well... in the LIR dumps, labels always appear on their own line so the possibility for confusion is low.  And the 'xt2 = xt ...' syntax makes it look like the guard returns a value that can be used later on, which is not so.
(In reply to comment #49)

> In other words... here is how I build a simple one-value diamond using fake
> phi nodes:
> 
>     phi = lir->insAlloc(...);
>     br1 = lir->insBranch(LIR_jf, c);
>     x = ...
>     lir->insStore(x, phi, ...);
>     br2 = lir->insBranch(LIR_j);
>     br1->setTarget(lir->insLabel());
>     y = ...
>     lir->insStore(y, phi, ...);
>     br2->setTarget(lir->insLabel());
>     z = lir->insLoad(..., phi);
> 
> How do I do this with real phi nodes?  I'm really having trouble working
> this out.  I looked at TR but the extra layer on top of the basic LIR
> interface confused me.

     br1 = lir->insBranch(LIR_jf, c);
     x = ...
     br2 = lir->insBranch(LIR_j);
     br1->setTarget(lir->insLabel());
     y = ...
     phi = lir->insPhi(LIR_phii, 2, x, y);  // for int result
     LIns* predNodes[] = { br2, NULL };     // predecessors: br2, fallthrough
     LIns* phiNodes[] = { phi };
     br2->setTarget(lir->insLabel(2, predNodes, 1, phiNodes));
     z = phi;
Phi nodes are always attached to a label.  The arguments of the phi nodes correspond to the flow predecessors of the label.  To establish the positional correspondence, a label with one or more attached phi nodes also has a list of flow predecessors.  This is either a branch instruction, or NULL for the fallthrough case.
Attachment #468462 - Flags: review?(nnethercote)
Attachment #468462 - Flags: feedback?(rreitmai)
Attachment #468462 - Flags: feedback?(lw)
Attachment #468463 - Flags: feedback?(rreitmai)
(In reply to William Maddox from comment #45)
> Early experiments with a larger set of inlining patches on IA32 showed
> disappointing results using phi nodes compared against using temporary stack
> slots.  This may be an artifact of x86, which goes to great lengths to
> optimize such memory references in order to compensate for its small
> register set.  Experiments on other platforms are in order.

Bill: how much potential value from following through on this?  Do you think the patches posted here are worth resurrecting, or should we focus on our bigger-fish-to-fry?
The experiment was interesting, but not profitable.  It was hindered by our register allocator, which forced the phi nodes to be unnecessarily spilled to memory more often than not.  Once a phi node has spilled, the code is pretty much the same as what we get at present, simulating the phi nodes with memory load/stores to an explicit temporary. Phi nodes can only help us when they get allocated to registers.
If we were to make a significant investment in Nanojit these days, I'd write a better register allocator.  Only then might it make sense to revisit phi nodes.
Note also that the experiment used phi nodes only for localized control flow introduced by speculative inlining.  We'd need an into-SSA conversion pass to deal with control flow edges induced by branches occuring at the ABC level.
wontfix: see comment 55 and comment 56
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: