Closed Bug 539874 Opened 10 years ago Closed 10 years ago

nanojit: remove LIR_ov

Categories

(Core Graveyard :: Nanojit, defect)

x86
macOS
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: njn, Assigned: njn)

References

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)

Attachments

(3 files, 4 obsolete files)

Attached patch patch, i386 only (obsolete) — Splinter Review
Bug 495569 made a number of suggestions for cleaning up the semantics of LIR.  One of these was to replace LIR_ov with something cleaner.  Julian wrote a partial patch to do this.

Having been annoyed by LIR_ov several times recently, I've taken his patch and reworked it more thoroughly.  It now works on i386 and passes all the TM trace-tests.  It doesn't work on the other back-ends, and lirasm hasn't yet been updated.

I stuck with Julian's approach of replacing add/ov/xt, sub/ov/xt and mul/ov/xt sequences with single instructions addxov, subxov and mulxov.  There were several possibilities discussed in bug 495569 for removing LIR_ov but IMO this is the only clean one because all the others involved implicit dependencies between instructions just as LIR_ov does.

Pros of this approach:
- LIR_ov is gone, along with its implicit dependencies and all the headaches
  they cause.

- On i386, for LIR_add and some LIR_mul cases we can use LEA instead of ADD/MUL
  in the generated code because we know the condition codes no longer need to
  be set.  (The patch currently does this only in the easy case where one of
  the operands is constant.)  This has the potential to reduce register pressure
  (because LEA is effectively a 3-operand instruction) and also might allow
  multiple instructions to be combined into one.  (Nb: 'mul x, 8' is very
  common in trace-tests.)

- Overflowable add/sub/mul operations are no longer kept alive if the result is
  never used.  Previously they were because xt/xf guards are statements.
  This was touted in bug 495569 as a potentially large benefit of removing
  LIR_ov.  Unfortunately, it seems that TM never generates code in which the
  overflowable add/sub/mul result is dead -- at least, it never occurs in
  trace-tests.  So, not as useful as everyone thought.

Cons:
- The patch increases code size by 162 lines.  Much of this is due to the new
  insGuardXov() function in the LIR writer pipeline.

- Two extra opcodes are used.  (Actually, I think LIR_add and LIR_addp can 
  now be merged, in which case it's only one extra opcode.)  But it's hardly
  the opcode explosion that was mentioned as a possibility several times in 
  bug 495569.

SunSpider gives unchanged results, AFAICT through all the variation.  This isn't surprising as the patch hardly changes the generated code.
Attachment #421772 - Flags: review?(edwsmith)
This proposal seemed to be more controversial the first time around...
(In reply to comment #1)
> This proposal seemed to be more controversial the first time around...

Where? I missed the fight, darnit. FWIW I like the proposal and patch.

/be
(In reply to comment #2)
> > This proposal seemed to be more controversial the first time around...
> 
> Where? I missed the fight, darnit.

Bug 495569... you didn't miss it, you have two comments on that bug!
(In reply to comment #0)
> Created an attachment (id=421772) [details]

> - Two extra opcodes are used.  (Actually, I think LIR_add and LIR_addp can 
>   now be merged, in which case it's only one extra opcode.)  But it's hardly
>   the opcode explosion that was mentioned as a possibility several times in 
>   bug 495569.

TR still uses LIR_addp in a way that requires it non be subject to CSE and not merged with LIR_add; however, it can use x86 add or lea interchangeably.  It is used to compute interior pointers to objects in such a way that the GC always can still see the original pointer during a conservative stack scan.  (tenuous, I admit).

Someday we'll want to add LIR_addjov (etc), which should fit nicely in the new scheme.
* In ExprFilter::insGuardXov() it might be more advantageous to use int64 math, or to detect overflow w/out using larger data types.  I don't imagine this code is very hot but on softfloat or slow-fpu devices, it might matter.

* It doesn't look like mul(x,1) => x is handled, I remember someone saying recently it's a commonly generated expression.

* SanityFilter doesn't have new checks for the new instructions, but it probably doesn't matter if the new and improved ValidateWriter is coming down the pipe soon.
Attachment #421772 - Flags: review?(edwsmith) → review+
Attached patch NJ patch (obsolete) — Splinter Review
This fixes the other back-ends too.  Jacob, the mul_*.in lirasm tests required some major surgery, I think I've preserved their intent but I haven't yet tested on my ARM VM (I'll do that tomorrow).  Can you also check the multiplication logic in the ARM backend?  That was reworked extensively.

Other than those things, not that different to the first patch.
Attachment #421772 - Attachment is obsolete: true
Attachment #423481 - Flags: review?(edwsmith)
Attachment #423481 - Flags: review?(Jacob.Bramley)
Attached patch TM patchSplinter Review
Attachment #423482 - Flags: review?(gal)
Attached patch ARM fix (obsolete) — Splinter Review
I tried the lirasm tests on ARM.  With arch=6 they all passed.  With arch=5 mul_xxx.in exited at the wrong place.  I think the attached diff identifies a bug.  The diff in the generated code was like this:

@@ -90,7 +90,7 @@
   ...   rsbmi ip, ip, #0x0
   ...   mul r10, ip, ip
   ...   movs ip, ip, lsr #16
-  ...   cmp  r10, r10, asr #31
+  ...   cmp  ip, r10, asr #31
   ...   bne 0x...

The '+' line matches what the comment says should happen.

With the diff applied all tests pass on arch=5 and arch=6.  I also confirmed that the test failure occurs with arch=5 with the old tests, so that's some confirmation that the new tests are equivalent to the old ones.  Jacob, can you check this patch?  Thanks.
Attachment #423654 - Flags: review?(Jacob.Bramley)
Comment on attachment 423481 [details] [diff] [review]
NJ patch

Lovely! Much neater than the LIR_ov stuff.

>+    // Because MUL can't set the V flag, we use SMULL and CMP to set the Z flag
>+    // to detect overflow on multiply. Thus, if we have a LIR_mulxov, we must
>+    // be conditional on !Z, not V.
>+    ConditionCode cc = ( op == LIR_mulxov ? OppositeCond(EQ) : VS );

"NE" may be clearer than "OppositeCond(EQ)". They're equivalent, so it has no functional effect.

>+            case LIR_sub:
>+            case LIR_subxov:    asm_sub_imm(rr, ra, imm32, 1);  break;

This sets the condition flags after LIR_sub even when it doesn't need to. As far as I'm aware, we should set them for every xov case but not in any other case. (The fourth argument should be '1' to set them, and defaults to '0'.)

The same applies to the register-based versions (SUBs etc) below that block.

This code works, but may be inefficient as adding the S flag when it isn't necessarily can create false dependencies and inhibit things like dual-issue possibilities.

This is probably best:
+            case LIR_iaddp:
+            case LIR_add:       asm_add_imm(rr, ra, imm32);     break;
+            case LIR_addxov:    asm_add_imm(rr, ra, imm32, 1);  break;
+            case LIR_sub:       asm_sub_imm(rr, ra, imm32);     break;
+            case LIR_subxov:    asm_sub_imm(rr, ra, imm32, 1);  break;

...

Also, for MUL, we don't care about overflow in the non-xov case, so a simple MUL implementation will suffice (and be an improvement on what we have before). This will be faster than SMULL! It's non-trivial (as SMULL is) because you have to deal with ARMv5's strange rX=rX*rY problem. I can write a patch if you like. (I don't have time today, and probably not tomorrow, but I could look later in the week.)

----

The rest looks sound. I can't comment on the other architectures as I know little about them.
Attachment #423481 - Flags: review?(Jacob.Bramley) → review+
Attachment #423654 - Flags: review?(Jacob.Bramley) → review+
Comment on attachment 423654 [details] [diff] [review]
ARM fix


>-                    ALUr_shi(AL, cmp, 1, IP, rr, rr, ASR_imm, 31);
>+                    ALUr_shi(AL, cmp, 1, IP, IP, rr, ASR_imm, 31);

Err. Yeah. That's right, but we're setting some should-be-zero bits in the instruction to something that isn't zero (IP). Something came up related to this a while back, but I can't remember exactly what it was.

In any case, the error is not in your patch (and it doesn't make anything worse), so go for it.
(In reply to comment #10)
> >-                    ALUr_shi(AL, cmp, 1, IP, rr, rr, ASR_imm, 31);
> >+                    ALUr_shi(AL, cmp, 1, IP, IP, rr, ASR_imm, 31);
> 
> Err. Yeah. That's right, but we're setting some should-be-zero bits in the
> instruction to something that isn't zero (IP). Something came up related to
> this a while back, but I can't remember exactly what it was.

It's bug 536153 ("NativeARM.cpp: Incorrect use ALUr_shi to generate compare
insns").  There's a patch that fixes it and is r+'d there but unfortunately
did not get landed so far.  Patch is needed to cause NJ to generate ARM ARM
complaint code, which is important if you want to run it through Valgrind.
Attached patch NJ patch 2Splinter Review
(In reply to comment #5)
> * In ExprFilter::insGuardXov() it might be more advantageous to use int64 math,
> or to detect overflow w/out using larger data types.  I don't imagine this code
> is very hot but on softfloat or slow-fpu devices, it might matter.

I simply cut+paste (with modifications) the equivalent code from
ExprFilter::ins2().  I doubt it would be noticeable but if you really want this
can you put it in a follow-up bug?

> * It doesn't look like mul(x,1) => x is handled, I remember someone saying
> recently it's a commonly generated expression.

Yes, bug 539876 added it.  I added it to ExprFilter::insGuardXov().

> * SanityFilter doesn't have new checks for the new instructions, but it
> probably doesn't matter if the new and improved ValidateWriter is coming down
> the pipe soon.

I'd forgotten to add insGuardXov() to ValidateWriter, now done.  Also, I
found that ValidateWriter::insGuard() wasn't calling typeCheckArgs(), so I
added that.


(In reply to comment #9)
>
> >+    ConditionCode cc = ( op == LIR_mulxov ? OppositeCond(EQ) : VS );
>
> "NE" may be clearer than "OppositeCond(EQ)". They're equivalent, so it has no
> functional effect.

Done.

> >+            case LIR_sub:
> >+            case LIR_subxov:    asm_sub_imm(rr, ra, imm32, 1);  break;
>
> This sets the condition flags after LIR_sub even when it doesn't need to. As
> far as I'm aware, we should set them for every xov case but not in any other
> case. (The fourth argument should be '1' to set them, and defaults to '0'.)
>
> The same applies to the register-based versions (SUBs etc) below that block.

Fixed.  I got the add/addxov cases right, I just mucked up sub/subxov.

> Also, for MUL, we don't care about overflow in the non-xov case, so a simple
> MUL implementation will suffice (and be an improvement on what we have before).
> This will be faster than SMULL! It's non-trivial (as SMULL is) because you have
> to deal with ARMv5's strange rX=rX*rY problem. I can write a patch if you like.
> (I don't have time today, and probably not tomorrow, but I could look later in
> the week.)

I filed it as a follow-up, bug 542629.


(In reply to comment #11)
> (In reply to comment #10)
> > >-                    ALUr_shi(AL, cmp, 1, IP, rr, rr, ASR_imm, 31);
> > >+                    ALUr_shi(AL, cmp, 1, IP, IP, rr, ASR_imm, 31);
> >
> > Err. Yeah. That's right, but we're setting some should-be-zero bits in the
> > instruction to something that isn't zero (IP). Something came up related to
> > this a while back, but I can't remember exactly what it was.
>
> It's bug 536153 ("NativeARM.cpp: Incorrect use ALUr_shi to generate compare
> insns").  There's a patch that fixes it and is r+'d there but unfortunately
> did not get landed so far.

I just landed that patch.  I've reapplied my patch, which is still needed
for the arch=5 fix.


Thanks for all the comments!
Attachment #423481 - Attachment is obsolete: true
Attachment #423918 - Flags: review?(edwsmith)
Attachment #423481 - Flags: review?(edwsmith)
Attachment #423918 - Attachment description: patch → NJ patch 2
Attachment #423654 - Attachment is obsolete: true
When is this likely to land in NJ? TR would like to see it as it could reduce the complexity of LIR_mul implementation for ARM (as we don't use LIR_ov yet and so the extra code for overflow detection is wasted for us).
(In reply to comment #13)
> When is this likely to land in NJ? TR would like to see it as it could reduce
> the complexity of LIR_mul implementation for ARM (as we don't use LIR_ov yet
> and so the extra code for overflow detection is wasted for us).

As soon as Andreas and Ed do their reviews!
Comment on attachment 423482 [details] [diff] [review]
TM patch

>-        if (!result->isconst() && result->isop(v) && (!IsOverflowSafe(v, d0) || !IsOverflowSafe(v, d1))) {
>+        if (!IsOverflowSafe(v, d0) || !IsOverflowSafe(v, d1)) {

Why is the const check no longer needed?
It's not that it's not needed, rather it can't happen -- 'result' is now computed inside that if-then-else.

In other words, the old code inserted the arith op and then maybe inserted a guard.  Now we insert an arith-and-guard op which may be converted to a const.  Make sense?  It's a bit subtle, before I land this I'll do some checking to make sure that the generated native code is unchanged.
The proposed new instructions allow only for an exit on overflow, not an arbitrary branch.  I'm working on a patch for TR to inline arithmetic fast-path code, and have been using LIR_ov.  How difficult would it be to provide addjov, subjov, and muljov?  It seems that it ought to be straightforward, but I've not looked into the encoding of the LIR instructions and whether there is a problem encoding a value-returning branch.  I'd also be interested in the argument against a value-producing xov, which would also admit value-producing branches as in https://bugzilla.mozilla.org/show_bug.cgi?id=495569#c68, though I don't have any strong feelings against combining these with the operators.
I imagine LIR_addjov et al wouldn't be hard -- a value-returning branch isn't so different to a value-returning guard -- but I haven't tried so I can't be sure.

The LIR_xov idea is no good.  Here's an example:

  10 add  5, 6
  11 xov 10

The 'xov' still implicitly depends on the 'add' through the machine's condition codes.  The whole point of this bug is to get rid of that implicit dependence because it's error-prone (eg. bug 538484, bug 543161).
(In reply to comment #17)
> How difficult would it be to provide addjov, subjov, and muljov?

This kind of thing is a good argument for separating the concepts
"primitive operation" (add, sub, mul, etc) and "binary operator
arithmetic node".  At the moment these are glommed together in LIR,
leading to this problem.

So .. in an ideal world, we'd have an enum describing primitive
operations, and then LIR would have a binary-operator (value) node
holding a member of this enum indicating what to do, and a
binary-operator-w-overflow-exit (statement) node also holding a member
of this enum.  This would allow any binary arithmetic operation to be
of either "vanilla" or "with-exit-check" flavour, without having to
essentially duplicate the ops (and their constant folding rules, and
wotnot) which is what's happening at the momemt.
This is pretty much what you'd have if LIR was organized like this:

LIR_add,sub,etc
LIR_xov(binaryop, GuardRec)
LIR_jov(binaryop, target)

with xov/jov being value-producing statement instructions that maybe also branch.  the assemblers for those would emit the vanilla binary instruction in the right place, just like the current jt/jf emitters also generate the condition.

side note: jt/jf(ov(add)) would be too awkward because jt/jf would then have to produce the value returned by the add, if the branch/exit wasn't taken.

(i still am okay with the current addxov/mulxov design;  i think we can tolerate 3 more J variants... six total is okay with me).
(In reply to comment #19)
> 
> So .. in an ideal world, we'd have an enum describing primitive
> operations, and then LIR would have a binary-operator (value) node
> holding a member of this enum indicating what to do, and a
> binary-operator-w-overflow-exit (statement) node also holding a member
> of this enum.  This would allow any binary arithmetic operation to be
> of either "vanilla" or "with-exit-check" flavour, without having to
> essentially duplicate the ops (and their constant folding rules, and
> wotnot) which is what's happening at the momemt.

The encoding of LIR instructions is quite compact.  A binary operation has three words:  the two operands, and another word holding the opcode (8 bits) plus reservation info (24 bits).  Using a 2-level opcode system it would be difficult to avoid increasing the size of these operations on 32-bit machines.

Having (potentially) 9 opcodes for add/sub/mul variations instead of having 3 opcodes with 3 secondary opcodes isn't our most pressing problem at the moment.
I think it's important to write down, somewhere, whether the value produced by addxov/subxov/mulxov is valid if the exit is taken.  Doesn't matter to me which way it is specified, though.  it'd be clean if it is valid on either branch but I dont know if that introduces unwanted complexity.
Comment on attachment 423918 [details] [diff] [review]
NJ patch 2

* (optional followon bug) for clarity's sake, the fourth argument to asm_add_imm() in NativeARM.cpp should be required (1 or 0).
* i'm not very familiar with Sparc, so consider this a rubber stamp there.
* in NativeX64.cpp, we now have the opportunity to use LEA for add/sub immediate w/out worrying about whether the conditions are needed.  (the comment for add R,R also applies to add/sub R,IMM)
* i think we need a 64bit offset check in asm_branch_xov in NativeX64.cpp
Attachment #423918 - Flags: review?(edwsmith) → review+
Comment on attachment 423482 [details] [diff] [review]
TM patch

Make sure you double and tripple check the ->isconst() removal. It makes sense that its no longer needed since we now insert a combo instruction that contains the guard and the guard will be eliminated if the expr becomes constant. This code used to generate bogus LIR_ov instructions pointing to loads or constants. Sorry for the delay.
Attachment #423482 - Flags: review?(gal) → review+
(In reply to comment #22)
> I think it's important to write down, somewhere, whether the value produced by
> addxov/subxov/mulxov is valid if the exit is taken.

Isn't it a moot question?  If you've exited then the result isn't usable, AIUI.
#25: Traces attach to guards. We currently don't refer to those values directly, but we could (and we used to with tree compilation).
(In reply to comment #25)
> (In reply to comment #22)
> > I think it's important to write down, somewhere, whether the value produced by
> > addxov/subxov/mulxov is valid if the exit is taken.
> 
> Isn't it a moot question?  If you've exited then the result isn't usable, AIUI.

It is fine if you want to define it that way, but I don't think it's moot.  When the exit path is traced, it might be desireable to still use the overflowed value to construct the correct result.  the addxov instruction dominates the trace path and the exit path, so there's nothing wrong with defining it to have a result, at least with with add/sub where the overflow is 1 bit.  Maybe not with mul (too many bits of information are lost).

I'm also thinking ahead to 'j' variants that don't exit, just branch.  Of course they can be specified independently from the 'x' opcodes.
I guess the overflowed value can be used on exit, assuming you know which register holds it.  I don't understand how that would work, though.

Anyway, the LIR has changed but the generated code hasn't.  Eg. when we had this:

      add1 = add ld1, JSVAL_INT
  00f7225fc1   add ebx,1
      ov1 = ov add1       # codegen'd with the xt
      xt1: xt ov1 -> pc=0x9b0dd1a imacpc=(nil) sp+0 rp+0 (GuardID=026)
  00f7225fc4   jo    0xf7215fa8

we now have this:

     addxov1 = addxov ld1, JSVAL_INT -> pc=0x8accd1a imacpc=(nil) sp+0 rp+0 (GuardID=026)
  00f722bfc1   add ebx,1
  00f722bfc4   jo    0xf721bfa8

In the first case I guess if you know that 'add1' is in ebx you can use it off-trace, and likewise for 'addxov1'/ebx in the second case.

The description for addxov in LIRopcode.tbl is currently this:

  // 32-bit integer addition; exit if overflow occurred

If I change it to this:

  // 32-bit integer addition; exit if overflow occurred; result is valid on exit

would that make you happy?
Attachment #427707 - Attachment is obsolete: true
Attachment #427708 - Attachment is patch: true
Attachment #427708 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #28)

> would that make you happy?

Yep.  i'm just looking for LIR semantics to be specific one way or the other.
http://hg.mozilla.org/mozilla-central/rev/7a62d623d36a
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.