Closed Bug 458279 Opened 16 years ago Closed 14 years ago

MOPS opcodes need to be optimized on CodegenLIR/nanojit

Categories

(Tamarin Graveyard :: Virtual Machine, defect, P4)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: stejohns, Assigned: stejohns)

References

Details

Attachments

(2 files, 23 obsolete files)

27.16 KB, patch
Details | Diff | Splinter Review
17.42 KB, patch
Details | Diff | Splinter Review
The fast memory-access opcodes added in https://bugzilla.mozilla.org/show_bug.cgi?id=458277 aren't yet supported in nanojit mode (they fall back to interpreter-only mode).
Correction; as of the latest patch in http://hg.mozilla.org/users/stejohns_adobe.com/tamarin-mops/ they simply generate method calls; however, the x86-32 specialization done in MIR produces substantial performance gains and should be ported to LIR for all architectures.
Depends on: 458277
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Blocks: 477785
Summary: MOPS opcodes need to be implemeted for CodegenLIR/nanojit → MOPS opcodes need to be optimized on CodegenLIR/nanojit
Depends on: 479456
No longer blocks: 477785
Flags: flashplayer-qrb? → flashplayer-qrb+
Priority: -- → P3
Target Milestone: --- → flash10.x
Priority: P3 → P4
Attached patch Rough patch #1 (obsolete) — Splinter Review
Rough initial patch, totally incomplete, but to get the ball rolling...

This patch li8, li16, li32, lf64 (but not lf32) using nanojit. It passes acceptance test on MacTel32 and MacTel64, but has a number of serious deficiencies:

-- there's no nanojit op to store 8 or 16 bit values, so these are handled by callouts. (see bug 527083)

-- there's no nanojit op to coerce between 32-bit and 64-bit floating point values, so that's handled by callout. (see bug 527246)

-- it's only implemented for AVMPLUS_UNALIGNED_ACCESS. The opcodes in question require that unaligned loads/stores work correctly, so architectures that don't allow such operations will have to do byte-at-a-time operations. (Probably via callouts.)

-- it doesn't handle byte-swapping. The opcode in question require that the values in memory are always stored little-endian, regardless of the cpu's architecture (thus big-endian machines must swap for loads and stores.)

-- the MIR version had an optimization that could collapse adjacent range checks in many cases; this one doesn't, it emits a range check for every load/store.

-- the MIR version had an optimization that allowed it to emit actual addresses into the jitted code, which were then patched as necessary later on. this one doesn't, requiring extra reloads from domain->globalMemory every time.

-- the MIR version was able to combine load-unsigned+sign-extend operations into a single operation on the backend; this one doesn't, as nanojit has no load-signed primitives (see bug 527085)

Obviously, then, this patch is offered more as proof-of-concept, so consider the review to be one of whether the general approach seems sound or not. Fixes to most (if not all) of the above issues will be required prior to landing.
Assignee: nobody → stejohns
Attachment #410995 - Flags: review?(edwsmith)
see also bug 527258 about whether unaligned-access is correct for ARM or not.
Attached patch Rough patch #2 (obsolete) — Splinter Review
Cleaned-up version of Patch #1; still not ready for review, but closer to useful:

This patch has li8, li16, li32, lf64 (but not lf32) and si32, sf64 (but not si8, si16, sf32) using nanojit directly; other operations use callout functions. It passes acceptance test on MacTel32 and MacTel64, but has a number of serious deficiencies:

-- we use LIR_ldcb and LIR_ldcs for 8/16 bit loads, but this is arguably wrong; we really need volatile loads to match expected behavior (see bug 527085). Acceptance tests pass, but probably more extensive usage would fail in subtle ways.

-- there's no nanojit op to store 8 or 16 bit values, so these are handled by
callouts. (see bug 527083)

-- there's no nanojit op to coerce between 32-bit and 64-bit floating point
values, so that's handled by callout. (see bug 527246)

-- AVMPLUS_UNALIGNED_ACCESS and AVMPLUS_BIG_ENDIAN systems always use callouts vs inlined code. This may or may not be acceptable for a final version; I suspect that at least some ops (eg li16/si16) are probably worth inlining on such systems.

-- the MIR version had an optimization that could collapse adjacent range
checks in many cases; this one doesn't, it emits a range check for every
load/store.

-- the MIR version had an optimization that allowed it to emit actual addresses
into the jitted code, which were then patched as necessary later on. this one
doesn't, requiring extra reloads from domain->globalMemory every time.

-- the MIR version was able to combine load-unsigned+sign-extend operations
into a single operation on the backend; this one doesn't, as nanojit has no
load-signed primitives (see bug 527085)

So patch is still proof-of-concept, but again, worthy of cursory examination...
Attachment #410995 - Attachment is obsolete: true
Attachment #411001 - Flags: review?(edwsmith)
Attachment #410995 - Flags: review?(edwsmith)
Depends on: 527246, 527083, 527085, 527258
No longer depends on: 458277, 479456
Comment on attachment 411001 [details] [diff] [review]
Rough patch #2

initial comments (review removed, too early for + or -)

* do we really need duplicate range-check-failed functions?  the jit code could call MethodEnv::mopRangeCheckFailed().  or, everyone could use mop_rangeCheckFailed from instr.cpp regardless of VMCFG_JIT_MOPS.  (instr.cpp isn't meant to be jit specific).

* suggest breaking patch up in a couple layers:
  * moving code around and into instr.cpp/h
  * new load/store opcodes (pure nanojit changes)
  * add feature jit-mops
Attachment #411001 - Flags: review?(edwsmith)
(In reply to comment #5)
> (From update of attachment 411001 [details] [diff] [review])
> initial comments (review removed, too early for + or -)
> 
> * do we really need duplicate range-check-failed functions?  the jit code could
> call MethodEnv::mopRangeCheckFailed().  or, everyone could use
> mop_rangeCheckFailed from instr.cpp regardless of VMCFG_JIT_MOPS.  (instr.cpp
> isn't meant to be jit specific).

yeah, the latter is what I'm going in my working code, all the MethodEnv calls are going to go away.
 
> * suggest breaking patch up in a couple layers:
>   * moving code around and into instr.cpp/h
>   * new load/store opcodes (pure nanojit changes)
>   * add feature jit-mops

Good call. At this point I want to get something pretty close to landable, then I'll split it into reviewable chunks.
Attached patch Clean up Domain's mops code (obsolete) — Splinter Review
Re-doing this as smaller patches for review...

First is drive-by cleanup in Domain: privatize vars, convert to C99 types, and (most importantly) remove all the memory-protection-permission code. The new jit code probably won't emit the constants in code memory (at least initially), and if we do end up needing it, we need to use the code in nanojit anyway.
Attachment #411001 - Attachment is obsolete: true
Attachment #411732 - Flags: review?(edwsmith)
Attached patch Correct Domain patch (obsolete) — Splinter Review
Same comments as before, just attached wrong version of patch
Attachment #411732 - Attachment is obsolete: true
Attachment #411733 - Flags: review?(edwsmith)
Attachment #411732 - Flags: review?(edwsmith)
Attached patch Refactor mops callouts (obsolete) — Splinter Review
Next patch (additive to previous):
-- move all the mops load/store callouts from methodenv into instr.cpp and update LIR and Interpreter to use them
-- the new load/store callouts don't do any range checking, they assume the caller does it. 
-- added code to LIR to emit range-checks prior to the load/store callouts. this code is naive relative to MIR but will be improved in subsequent patches, to collapse redundant checks. (in the short-term this means that LIR code generation for the MOPS code will be fatter, as the range checks used to be inside the call itself.)
-- we allocate a bit of codemgr memory for each method that uses MOPS opcodes to stash a copy of the memory base and size. this requires an extra load (relative to MIR, which inlined the address/size into the instr stream itself) but greatly simplifies the code; I want to actually measure the perf before trying that approach, as it will be tricky to do properly in LIR
Attachment #411746 - Flags: review?(edwsmith)
only 32-bit int and 64-bit doubles are active now, since nj lacks the other primitives we need.

-- I was unsure if I should have AVMFEATURE_JIT_MOPS require UNALIGNED and LITTLE_ENDIAN at the feature level, or implementation level. currently opted for the latter, mainly because UNALIGNED isn't really part of the feature system, so it would make configuring a little awkward. opinions?
-- the idea is that when nj implements all the primitives, this will morph to either-or at compile time; this is a temporary compromise.
Attachment #411757 - Flags: review?(edwsmith)
Optimization: when doing a load/store that is a constant offset from a previous load/store, emit an instruction with the right displacement rather than doing math at runtime. Also modified th mops.abs testcase to exercise this. (It's a common idiom in Alchemy-generated code.)
Attachment #411772 - Flags: review?(edwsmith)
Attachment #411733 - Flags: review?(edwsmith) → review+
CodegenLIR.cpp:4901  are we sure this comment is correct?  since size is uint32_t, it could be >2GB, and a negative int could fall in range.  but why would "addr" (offset from globalMemoryBase) be signed in the first place; if everything is unsigned then there's no need for the trickery or at least no need for the elaborate explanation?

if we support >2GB memory blocks (on a 64bit cpu, say) then this code needs to be conditional and when size is large, we need to do the whole thing with 64bit math (then the 64bit negative mopAddr value will wrap around and still be larger than the large 32 bit size constant).

line 4912: if mopsMemoryBase is not pointing to a GC object, then we can use LIR_piadd here instead of addp.  LIR_piadd is a pure add operator whereas LIR_addp is restricted from certian optimizations that can leave dangling interior pointers.

line 4888: this would all be somewhat cleaner if the mops patching code in Domain were moved into CodegenLIR, maybe with the patching lists living on CodeMgr goes away and so do the necessary patching lists.  and this way when the jit changes and we change how patching is done, it ought to not cause churn in class Domain.
Comment on attachment 411746 [details] [diff] [review]
Refactor mops callouts

R+ as is (no bugs found); if the recommend cleanups are implemented, only re-review if you feel it changed enough to require a second look.
Attachment #411746 - Flags: review?(edwsmith) → review+
(In reply to comment #12)
> if we support >2GB memory blocks (on a 64bit cpu, say) 

The largest memory block supported by MMgc is 2^32-1 bytes, in practice this means 2^32-8 bytes, and that's including headers and so on, so the maximum user payload is 2^32-4096, probably.  This is true on 64-bit systems as well.
Comment on attachment 411757 [details] [diff] [review]
Use exiting nanojit primitives instead of callouts

I'm not convinced the extra fields in MopsInfo and the branchy ifdef-y logic when emitting loads and stores is worth the trouble.
 
why not split out the cases that we can support with inline loads/stores (right now, int32 and double), and factor so that shared code is in a helper subroutine?
Attachment #411757 - Flags: review?(edwsmith) → review-
Comment on attachment 411772 [details] [diff] [review]
Optimize loads with constant offsets

whitespace is getting messed up
(In reply to comment #12)
> CodegenLIR.cpp:4901  are we sure this comment is correct?  since size is
> uint32_t, it could be >2GB, and a negative int could fall in range.  but why
> would "addr" (offset from globalMemoryBase) be signed in the first place; if
> everything is unsigned then there's no need for the trickery or at least no
> need for the elaborate explanation?

I talked about this with Scott Peterson offline... in the existing MIR version, the offset is verified as INT_TYPE rather than UINT_TYPE, and he was of the opinion that offsets >2GB probably wouldn't work. So I'm going to review the patch to use int32 uniformly for offset, and check for negative values. (Yes, this sucks, but we're likely stuck with it now.)

> if we support >2GB memory blocks (on a 64bit cpu, say) then this code needs to
> be conditional and when size is large, we need to do the whole thing with 64bit
> math (then the 64bit negative mopAddr value will wrap around and still be
> larger than the large 32 bit size constant).

MMGC doesn't support blocks larger than 4GB even on 64bit, but the signedness of offset moots even this.


> line 4912: if mopsMemoryBase is not pointing to a GC object, then we can use
> LIR_piadd here instead of addp.  LIR_piadd is a pure add operator whereas
> LIR_addp is restricted from certian optimizations that can leave dangling
> interior pointers.

mopsMemoryBase is pointing to either:
-- storage that lives inside ByteArray, which is allocated via mmfx_new
-- Domain::globalMemoryScratch, which is embedded inside class Domain

we could easily change globalMemoryScratch to also be mmfx_new allocated, then change this op...

> line 4888: this would all be somewhat cleaner if the mops patching code in
> Domain were moved into CodegenLIR, maybe with the patching lists living on
> CodeMgr goes away and so do the necessary patching lists.  and this way when
> the jit changes and we change how patching is done, it ought to not cause churn
> in class Domain.

Yeah. Probably so. Let me look at doing that.
(In reply to comment #15)
> (From update of attachment 411757 [details] [diff] [review])
> I'm not convinced the extra fields in MopsInfo and the branchy ifdef-y logic
> when emitting loads and stores is worth the trouble.

yeah, it's ugly. I've already re-thought it in a private branch, I'll clean up and resubmit.
Comment on attachment 411733 [details] [diff] [review]
Correct Domain patch

pushed to redux as changeset:   3087:68b17ac5150f
Attachment #411733 - Attachment is obsolete: true
Comment on attachment 411746 [details] [diff] [review]
Refactor mops callouts

pushed to redux as changeset:   3088:ffcc8860bac9
Attachment #411746 - Attachment is obsolete: true
(In reply to comment #12)
> line 4888: this would all be somewhat cleaner if the mops patching code in
> Domain were moved into CodegenLIR, maybe with the patching lists living on
> CodeMgr goes away and so do the necessary patching lists.  and this way when
> the jit changes and we change how patching is done, it ought to not cause churn
> in class Domain.

So, I looked into doing this, and it's an ugly rathole... Domain and ByteArray are tightly coupled in an unpleasant way... and Flash/AIR are using a slightly different variant of ByteArray (vs the one in tamarin/shell). I'm going to take some baby steps to decouple them a little bit, but we may not want to move this into CodeMgr/JIT/etc just yet until we untangle it enough to fully understand the implications...
More Domain cleanup to try to decouple ByteArray from Domain, at least a little bit, along with drive-by cleanup of privatizing everything. The fundamental issue is that Domain needs to work with two different "ByteArray" objects, with completely different AS3 and C++ inheritances... we currently have everything just right so that they both work, but that's not sustainable. Long-term, we really need to unify them (in Tamarin) but that's far too ticklish for the short-term schedule; thus the introduction of ScriptObject::getGlobalMemoryProvider as the communications channel. Yes, it's evil, but we have precedent (and it's also far more robust than the old isMemoryObject method was).

Also changed the scratch memory to use mmfx_new memory to pave the way for the jit addition optimization mentioned in an earlier comment.

Lot of churn here (sorry bout that) but it's long overdue IMHO.
Attachment #411757 - Attachment is obsolete: true
Attachment #411772 - Attachment is obsolete: true
Attachment #412116 - Flags: review?(edwsmith)
Attachment #411772 - Flags: review?(edwsmith)
Attachment #412116 - Flags: review?(edwsmith) → review+
Comment on attachment 412116 [details] [diff] [review]
More Domain cleanup, ByteArray decoupling

pushed as changeset:   3186:9882a0bbd3fd
Attachment #412116 - Attachment is obsolete: true
nanojit only defines this for i386 currently. MOPS requires little-endian unaligned access, so only platforms that can support this will ever be supported directly, regardless of nanojit support (so x64 and later ARM chips will eventually be enabled, PPC/Sparc probably not)
Attachment #415929 - Flags: review?(edwsmith)
Comment on attachment 415929 [details] [diff] [review]
Have LIR use NJ_EXPANDED_LOADSTORE_SUPPORTED ops when appropriate

> (so x64 and later ARM chips will eventually be enabled, PPC/Sparc probably not)

support for unaligned access is the hard requirement.  we can support big endian cpu's if they can do unaligned memory access and if they have a cheap byte-swap instruction.  (ppc has one).

* NJ_EXPANDED_LOADSTORE_SUPPORTED is not exactly the right flag to use for loads we support on all platforms (32bit ints, 64bit floats), and at the same time, we need something that ensures we only use LIR ops for MOPS ops when the CPU has the proper semantics.  (little-endian, support-for-unaligned access)

If it is feasible to factor the range-check optimizations into a LirWriter filter then that would be preferrable.  Plus, comments explaining the transformations being performed in mopAddrToRangeCheckedRealAddrAndDisp() would help.  I am having a hard time following the code.  (e.g. before/after examples, notes on the context for the optimziation (basic blocks only?) )

(still studying the code looking for bugs, but that's enough feedback for now I hope).
Attachment #415929 - Flags: review?(edwsmith) → review-
(In reply to comment #25)
> If it is feasible to factor the range-check optimizations into a LirWriter
> filter then that would be preferrable.

Nevermind about LirWriter, I'm looking at the applied patch in context now and its not bad as-is.  I can still imagine a filter that generalizes the optimization we're doing but its probably not worth making a filter just for what we have (one callout in 4 spots, is not a big deal).
The sign-extend folding could cause loads to move relative to stores:

                if (opcode == OP_sxi8 && val->opcode() == LIR_ldzb)
                {
                    // optimize sxi8(li8) -> LIR_ldbx
                    val->initLInsLd(LIR_ldsb, val->oprnd1(), val->disp());
                    break;
                }

here's an ABC sketch.  assume address "A" has an interesting value "x":

   li8  A      ( x )
   pushint 0   ( x 0 )
   si8 A       ( x )    // now memory[A] == 0
   sxi8        ( x' )

the optimization will emit this LIR without the optimization:

   ldzb1 = ldzb A
   stb 0, A         // now memory[A] == 0
   result = sxi8 ldzb1  // do the sign extend of the loaded value

here is LIR with the optimization:

   ldzb1 = ldzb A    // dead code
   stb 0, A          // memory[A] == 0
   result = ldsb A   // loads 0

The fix is something that ensures there are no intervening stores.  the simplest fix is to require the OP_sxi8 be immediately after the load in question.  you can't detect that just by looking at the operand, something needs to look at the sequence of instructions, peephole-style.
(In reply to comment #25)
> support for unaligned access is the hard requirement.  we can support big
> endian cpu's if they can do unaligned memory access and if they have a cheap
> byte-swap instruction.  (ppc has one).

yeah, but it doesn't support unaligned access, does it? (not sure)
 
> * NJ_EXPANDED_LOADSTORE_SUPPORTED is not exactly the right flag to use for
> loads we support on all platforms (32bit ints, 64bit floats), and at the same
> time, we need something that ensures we only use LIR ops for MOPS ops when the
> CPU has the proper semantics.  (little-endian, support-for-unaligned access)

well, yeah, that's why it's combined with checks for LITTLE_ENDIAND and UNALIGNED_ACCESS

> Plus, comments explaining the
> transformations being performed in 
> mopAddrToRangeCheckedRealAddrAndDisp() would
> help.  I am having a hard time following the code.  (e.g. before/after
> examples, notes on the context for the optimziation (basic blocks only?) )

I'll see what I can do.

(In reply to comment #27)
> The sign-extend folding could cause loads to move relative to stores:

Nice catch. I'll resubmit.
(In reply to comment #27)
> The fix is something that ensures there are no intervening stores.  the
> simplest fix is to require the OP_sxi8 be immediately after the load in
> question.  you can't detect that just by looking at the operand, something
> needs to look at the sequence of instructions, peephole-style.

Gah. peephole-style optimizations are perversely painful to do in nanojit... there's no generic "prevIns" call (we'd need to link LirBuffer in both directions to do this generally), and both LirWriter and LirReader are only well-suited to single-insn-at-a-time work. I can probably find a way to work around it in this case but this isn't the only case I've seen... we should think if there's a better way of doing instruction-sequence optimizations.
(In reply to comment #28)
> (In reply to comment #25)
> > support for unaligned access is the hard requirement.  we can support big
> > endian cpu's if they can do unaligned memory access and if they have a cheap
> > byte-swap instruction.  (ppc has one).
> 
> yeah, but it doesn't support unaligned access, does it? (not sure)

Steven might be thinking of the load word with reversed bytes instruction (lwbrx), I assume too it needs to be aligned but I haven't checked.  Warren [1] notes that you can implement in-register byte reversal on PPC with three instructions (rotate followed by two rotate left with mask insert, details absent).

Sparc does not have a byte-swap instruction from what I can tell (only checked the v9 instruction set).  The best I can think of is 

(x >> 24) | (x << 24) | ((x & 0x00FF0000) >> 8) | ((x << 8) & 0x00FF0000)

which is 10 instructions and two temp registers on SPARC (and untested!)  On MIPS one should be able to use 0x0000FF00 analogously and avoid one temp register and the SETHI instruction (since immediates are 16 bits IIRC).

[1] Henry S Warren, "Hacker's Delight", Addison Wesley 2003.
(In reply to comment #29)
> (In reply to comment #27)
> > The fix is something that ensures there are no intervening stores.  the
> > simplest fix is to require the OP_sxi8 be immediately after the load in
> > question.  you can't detect that just by looking at the operand, something
> > needs to look at the sequence of instructions, peephole-style.
> 
> Gah. peephole-style optimizations are perversely painful to do in nanojit...
> there's no generic "prevIns" call (we'd need to link LirBuffer in both
> directions to do this generally), and both LirWriter and LirReader are only
> well-suited to single-insn-at-a-time work. I can probably find a way to work
> around it in this case but this isn't the only case I've seen... we should
> think if there's a better way of doing instruction-sequence optimizations.

It might be a lot easier to do the peephole analysis on the ABC instruction sequence.  The CodeWriter interface allows pipelined operations, it might be possible to introduce a peephole analyzer.  Since WordcodeEmitter includes a peephole optimizer, you might get ideas from that, although it is much more general.
(In reply to comment #31)
> It might be a lot easier to do the peephole analysis on the ABC instruction
> sequence.  The CodeWriter interface allows pipelined operations, it might be
> possible to introduce a peephole analyzer.  Since WordcodeEmitter includes a
> peephole optimizer, you might get ideas from that, although it is much more
> general.

Yeah, that definitely will be a simpler way to solve at least one of my problems (li8+sxi8 fusing), but I don't think it is amenable to some others (collapsing redundant range checks)... but let me ponder if there's a clever way to deal with that. 

In any event my complaint still stands: peephole opt at the LIR level would be useful but currently is painful.
Revised version of patch:
-- expand commenting in mopAddrToRangeCheckedRealAddrAndDisp(), and remove incorrect optimization we had before (for const-expr)
-- remove entirely the optimizations for sxi8 and sxi16 (these will be restored in a subsequent patch, probably at the ABC level)
Attachment #415929 - Attachment is obsolete: true
Attachment #416435 - Flags: review?(edwsmith)
Attached patch Specialize load-and-sign-extend (obsolete) — Splinter Review
(additive to patch 416435)

specialize li8+sxi8 and li16+sxi16 pairs in LIR and wordcode. This borrows the approach used by findpropglobal and findpropglobalstrict (ie, uses empty space in the ABC opcode range).
Attachment #416456 - Flags: review?(edwsmith)
Attached patch Use piadd instead of addp (obsolete) — Splinter Review
Per earlier comment: since mops memory is not a GCObject, we can use piadd instead of addp.
Attachment #416483 - Flags: review?(edwsmith)
Attached patch Smarten the range-check (obsolete) — Splinter Review
A little algebra and signed/unsigned trickery allows us to reduce all range checks to a single compare. This makes a dramatic speed difference, even without trying to collapse redundant checks within a single basic block.
Attachment #416495 - Flags: review?(edwsmith)
May seem counterintuitive, but the mopAddr can legitimately be negative, as it can be part of an earlier calculation (with a positive constant displacement) due to the optimization earlier in the function.
Attachment #416812 - Flags: review?(edwsmith)
Attachment #416483 - Flags: review?(edwsmith) → review+
Attachment #416812 - Flags: review?(edwsmith) → review+
Comment on attachment 416435 [details] [diff] [review]
 Have LIR use NJ_EXPANDED_LOADSTORE_SUPPORTED ops when appropriate, v2

* needs an assert (and explanation) somewhere prominent that the mops memory size is <2GB

* I had to refactor and comment the code using sumFitsInInt32 a bunch before i grokked what was happening.  the result is only slightly clearer, a comment at the end of each branch would help.... meh.

* ditto, in the code emitting the range check, a comment that shows the emitted code in C-like-syntax is extremely helpful.  e.g., for the block at CodegenLIR::4986:

// if (uint32(mopAddr) > uint32(mopsMemorySize - size)) goto rangeCheckFail_label


.. that said, we dont need another round of review, just as much clarity as you can stomach, in a few key places.

what makes review hard is getting the context in your head for what the big picture is for these mops.  maybe a separate doc would help with that.  not critical-path for these patches.
Attachment #416435 - Flags: review?(edwsmith) → review+
* stray commented-out code in AvmCore.cpp?

* instr.h: needs comments to clarify that li8/16 are zero extended (or rename to liz8/16 for symmetry?)

* Verifier.cpp:1792 the peephole logic can hiccup if there is a branch target without a label inbetween OP_li* and OP_sx*, which is possible in the context of forwards jumps:
       li8 &x      ( -- x )
       ...         ( -- x bool )
N:     iftrue N+2  ( -- x )
N+1:   pop         ( -- )
N+1:   li8 &y      ( -- y )
N+2:   sxi8        ( -- sx(y) | sx(x) }

you can test for the presence of an implicit label using (blockStates ? blockStates->get(pc+1) != NULL : false)

(explicit labels are no problem, the peepholer will see the label instruction).

* need to review same idea, but what if a try/catch range ends right inbetween the two instructions.  i think this is harmless since it doesn't imply a jump, but it would be smart to sketch an example and argue it's safe.  (or defer the analysis and only do the optimization for no-try-catch methods)... heck, it could a nearly-nonexistant pattern in Alchemy output.

* nice to have, if feasable: encapsulate the peephole logic in a separate CodeWriter.
Attachment #416456 - Flags: review?(edwsmith) → review-
>   // we want to fail range-check if
>   //      (mopAddr + disp < 0 || mopAddr + disp + size > mopsMemorySize)
                                                        ^^^
I think the test should be >=

   (mopAddr + disp < 0 || mopAddr + disp + size >= mopsMemorySize)

then, let A=mopAddr, D=disp, S=size, M=mopsMemorySize:

   A + D < 0 || A + D + S >= M
   A < -D    || A >= M - D - S
  !(A >= -D  && A <  M - D - S)

apply rule: (x >= min && x < max) ==> (unsigned(x-min) < unsigned(max-min))

  !(unsigned(A - -D) <  unsigned(M-D-S - -D))
  !(unsigned(A +  D) <  unsigned(M-D-S +  D))
  !(unsigned(A +  D) <  unsigned(M - S))
    unsigned(A +  D) >= unsigned(M - S)

if you buy all this, then in the code, CodegenMIR::5029: should s/LIR_ugt/LIR_uge.

lastly, D and S  (i.e. disp and size) are constant.  is this valid or does int wraparound invalidate it:

   unsigned(A + D)     >= unsigned(M - S)
   unsigned(A + D - D) >= unsigned(M - S - D)
   unsigned(A)         >= unsigned(M - const(S+D))
Attachment #416495 - Flags: review?(edwsmith) → review-
(In reply to comment #40)
> >   // we want to fail range-check if
> >   //      (mopAddr + disp < 0 || mopAddr + disp + size > mopsMemorySize)
>                                                         ^^^
> I think the test should be >=

sigh, no it shouldn't.  code is correct as-is.  the final question above about reassociating the constants still applies, though.
Attachment #416495 - Flags: review- → review+
(In reply to comment #40)
> lastly, D and S  (i.e. disp and size) are constant.  is this valid or does int
> wraparound invalidate it:
> 
>    unsigned(A + D)     >= unsigned(M - S)
>    unsigned(A + D - D) >= unsigned(M - S - D)
>    unsigned(A)         >= unsigned(M - const(S+D))

IIRC wraparound invalidates it; I tried that and had failures in obscure cases (I can dig 'em up if you are curious)
no need, it's wrong by inspection if A==0
(In reply to comment #39)
> * stray commented-out code in AvmCore.cpp?

oops
 
> * instr.h: needs comments to clarify that li8/16 are zero extended (or rename
> to liz8/16 for symmetry?)

cool

> * Verifier.cpp:1792 the peephole logic can hiccup if there is a branch target

possible, yes, but as a pragmatic matter, Alchemy seems to always emit these two instructions in lockstep. I'll see if I can fix it, but testing will be problematic

> * need to review same idea, but what if a try/catch range ends right inbetween
> the two instructions. 

see above. (Note that missing these is harmless, just less efficient since we load and then explicitly sign extend)

> * nice to have, if feasable: encapsulate the peephole logic in a separate
> CodeWriter.

Feasible but IMHO overkill; I looked at doing so but for this simple case it's far less code to take the easy way out. If we start adding other peephole optimizations then yeah, but for this one I think simple is good.
With (some of) Edwin's suggestions done
Attachment #416456 - Attachment is obsolete: true
Attachment #416973 - Flags: review?(edwsmith)
(In reply to comment #41)
> sigh, no it shouldn't.  code is correct as-is.  the final question above about
> reassociating the constants still applies, though.

sadly, some exhaustive testing reveals flaws; consider

   mopAddr=-2147483648 disp=-8 size=8 mopsMemorySize=8

which yields false for the old test but true for the optimized test.
(In reply to comment #46)

It seems there are unlikely-but-possible edge cases the range checking doesn't handle... I'm going to back off to a less aggressive version for now, and research possible improvements for future patches
Comment on attachment 416495 [details] [diff] [review]
Smarten the range-check

Not convinced the optimization works for all numeric combinations. Will investigate and resubmit.
Attachment #416495 - Attachment is obsolete: true
Comment on attachment 416435 [details] [diff] [review]
 Have LIR use NJ_EXPANDED_LOADSTORE_SUPPORTED ops when appropriate, v2

pushed (with minor mods) as part of changeset:   3295:b1842cff6d9f
Attachment #416435 - Attachment is obsolete: true
Comment on attachment 416483 [details] [diff] [review]
Use piadd instead of addp

pushed as part of changeset:   3295:b1842cff6d9f
Attachment #416483 - Attachment is obsolete: true
Comment on attachment 416812 [details] [diff] [review]
Address calc should use i2p, not u2p

pushed as part of changeset:   3295:b1842cff6d9f
Attachment #416812 - Attachment is obsolete: true
(In reply to comment #46)
> sadly, some exhaustive testing reveals flaws

flawed algebra, alas: my patch asserts that

    (mopAddr + disp < 0) == (mopAddr < -disp)

but the LHS above should really be written as (mopAddr+disp)%MAXINT < 0 (to handle the overflow/underflow cases), which isn't equivalent to the RHS.
(In reply to comment #51)
> (From update of attachment 416812 [details] [diff] [review])
> pushed as part of changeset:   3295:b1842cff6d9f

This patch is failing to compile on windows64, and has currently closed the tamarin-redux tree:


core\CodegenLIR.cpp(2920) : error C2220: warning treated as error - no 'object' file generated
core\CodegenLIR.cpp(2920) : warning C4267: 'argument' : conversion from 'size_t' to 'int32_t', possible loss of data
core\CodegenLIR.cpp(2930) : warning C4267: 'argument' : conversion from 'size_t' to 'int32_t', possible loss of data
core\CodegenLIR.cpp(2938) : warning C4267: 'argument' : conversion from 'size_t' to 'int32_t', possible loss of data
(In reply to comment #53)

Disregard comment #53, I now see that Steven reverted the changes in 3295
(In reply to comment #39)

> * Verifier.cpp:1792 the peephole logic can hiccup if there is a branch target
> without a label inbetween OP_li* and OP_sx*, which is possible in the context
> of forwards jumps:
>        li8 &x      ( -- x )
>        ...         ( -- x bool )
> N:     iftrue N+2  ( -- x )
> N+1:   pop         ( -- )
> N+1:   li8 &y      ( -- y )
> N+2:   sxi8        ( -- sx(y) | sx(x) }

I could have been clearer about how this example would be broken:

here's the output after peephole optimization:

        li8 &x      ( -- x )
        ...         ( -- x bool )
 N:     iftrue N+2  ( -- x )
 N+1:   pop         ( -- )
 N+1:   lix8 &y     ( -- y )    /* also advances verify-time pc to N+3 */
 N+2:   ???         ( -- sx(y) | sx(x) }
 N+3:   ...

In each bytecode, the verifier checks to see if PC == an implicit label (see the if block after "bool unreachable = false".).  Advancing pc past this point means the label won't be processed at all, which surely can cause problems.  At the very least, the sxi8 instruction won't be emitted at all, since we skipped it, which means it will not execute on the path when branch at N:iftrue is taken.

possible fix:  do not skip the sxi8.  in the JIT, if the input to sxi8 is lxi8, then the sign-extend is a no-op and the load can be directly referenced.  

in the test example, the JIT will have processed a new basic block starting at N+2, and will not ignore the sxi8 (since there are two control flow paths leading to it, thus two possible input values).

but in typical code, the lix8 dominates the sxi8, the input to sxi8 is known to be the load, and the jit can drop the redundant sxi8.
Attachment #416973 - Flags: review?(edwsmith) → review-
> possible fix:  do not skip the sxi8

nice thought. I'll cook up an example and verify that what we expect is
happening.
Revised to handle the implicit-label issue Ed pointed out. Also added testcase.
Attachment #416973 - Attachment is obsolete: true
Attachment #417167 - Flags: review?(edwsmith)
Attachment #417167 - Flags: review?(edwsmith) → review+
Depends on: 534613
Attached patch Further range-check improvements (obsolete) — Splinter Review
Further tweaks:
-- loading globalMemoryBase/Size via LIR_ldc is not right, since we need to reload if any user call could have changed the domain's memory. Added a new Filter to maintain this and reload as necessary.
-- track the range of active checks, so any that are provably redundant can be eliminated.
Attachment #417803 - Flags: review?(edwsmith)
Comment on attachment 417167 [details] [diff] [review]
Specialize load-and-sign-extend, take 3

pushed as 3316	19da674be2ef
Attachment #417167 - Attachment is obsolete: true
two nits:

It's not necessary to override LirWriter methods that just call out->whatever, since that's what the base methods do in LirWriter.

the code that uses and updates mopsRangeCheckFilter.activeMin/Max would be cleaner if it were encapsulated in the filter and just called from CodegenLIR... in the long run its much easier to think about filter behaviors when their internal state is well encapsulated.

but... can't find anything wrong with the logic, so as long as tests pass (etc), push at will.
Attachment #417803 - Flags: review?(edwsmith) → review+
I'll do some cleanup work before pushing.
Patch with Edwin's suggestions... since buildbot is down today I don't want to push it yet, so I might as well get a re-review :-)
Attachment #417803 - Attachment is obsolete: true
Attachment #418226 - Flags: review?(edwsmith)
Let's try that again... wrong patch uploaded, with incorrect range-checking code.
Attachment #418226 - Attachment is obsolete: true
Attachment #418233 - Flags: review?(edwsmith)
Attachment #418226 - Flags: review?(edwsmith)
Attachment #418233 - Flags: review?(edwsmith) → review+
Comment on attachment 418233 [details] [diff] [review]
418226: Further range-check improvements, v3

technically we could break content in the future by either raising the minimim mops memory size or lowering it.  i'd guess its more likely to break by raising it.  should we just set it to 1K and forget?  

(no need to re-review for that tho)
Attached patch More range-check tweaks (obsolete) — Splinter Review
Take advantage of the fact that (a > b) == (a >= b-1) for ints (if b>0 && b < ffffffff) to avoid a subtraction
Attachment #418446 - Flags: review?(edwsmith)
Comment on attachment 418233 [details] [diff] [review]
418226: Further range-check improvements, v3

http://hg.mozilla.org/tamarin-redux/rev/eb98f48cef15
Attachment #418233 - Attachment is obsolete: true
(In reply to comment #24)
> (PPC/Sparc probably not)

this test program compiles and runs fine on a G5 10.5 machine:

#include <stdio.h>
#include <stdint.h>

int main(int argc, char **argv)
{
    char data[100] = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
    printf("data %p\n", data);
    int* intp = (int*)(data+1);
    short* shortp = (short*)(data+1);
    int64_t* int64p = (int64_t*)(data+1);
    double* doublep = (double*)(data+1);
    printf("unaligned int %d\n", *intp);
    printf("unaligned short %d\n", *shortp);
    printf("unaligned int64 %d\n", *int64p);
    printf("unaligned double %f\n", *doublep);
    return 0;
}

not sure if this goes back to G4 or G3 machines, but it maybe we only need to do byte-swapping to support inlined mops on ppc.  lwbrx does it with one instruction.  if there are equvalents for other loads (float, short), then we may pass GO with no added instructions either.
The MOPS globalmemory handling is currently rooted in Domain, but really ought to be rooted in DomainEnv, per offline discussions. This corrects that. Note that this change reduces MOPS performance.
Attachment #419084 - Flags: review?(edwsmith)
This restructures internal fields in MethodEnv to allow for faster access to the now-DomainEnv-based globalmemory. With both these patches in place, we are back to ~95% or so of the performance we had before (but still not quite at the average performance level of the MIR-based MOPS code in FP10).
Attachment #419085 - Flags: review?(edwsmith)
Comment on attachment 419085 [details] [diff] [review]
Rearrange MethodEnv::Aux to improve MOPS performance

This patch works, but is a little conceptually icky (it's definitely infiltrating concerns pretty far down).

I think if we could do the op-hoisting as described by Edwin in https://bugzilla.mozilla.org/show_bug.cgi?id=533546#c9 we could avoid the need for this entirely... we could then ensure that DomainEnv is loaded once at the start of a MOPS-using method, and get essentially the same benefit with less cruft.
Attachment #418446 - Flags: review?(edwsmith) → review+
Attachment #419084 - Flags: review?(edwsmith) → review+
Comment on attachment 419084 [details] [diff] [review]
Move MOPS memory handling to DomainEnv

I'm going to hold off pushing this one pending further investigation -- we think there might be a better approach than the subsequent patch to make up the lost performance.
Comment on attachment 418446 [details] [diff] [review]
More range-check tweaks

pushed to TR as changeset:   3425:f184f146c647
Attachment #418446 - Attachment is obsolete: true
Comment on attachment 419085 [details] [diff] [review]
Rearrange MethodEnv::Aux to improve MOPS performance

looks like hoisting wins
Attachment #419085 - Flags: review?(edwsmith) → review-
Target Milestone: flash10.1 → flash10.2
Updated, rebased version of this patch. Note that this patch causes some perf regression (which will be largely mitigated by a subsequent patch)
Attachment #419084 - Attachment is obsolete: true
Attachment #422806 - Flags: review?(edwsmith)
Attachment #422806 - Attachment is obsolete: true
Attachment #422806 - Flags: review?(edwsmith)
Comment on attachment 422806 [details] [diff] [review]
Move MOPS memory handling to DomainEnv, v2

Marking obsolete; the nanojit hoisting approach is cleaner and already in place
Attachment #419085 - Attachment is obsolete: true
Comment on attachment 422806 [details] [diff] [review]
Move MOPS memory handling to DomainEnv, v2

I marked the wrong patch obsolete... de-obsoleting and re-requesting review
Attachment #422806 - Attachment is obsolete: false
Attachment #422806 - Flags: review?(edwsmith)
Attached patch Collapse range checking (obsolete) — Splinter Review
Smarten range checking to avoid redundant checks. This patch relies on Bug 522593 having landed.
Attachment #423016 - Flags: review?(edwsmith)
Edwin... I know this depends on bug 522593, but if you have bandwidth to look over this in advance of that, let me know.
Depends on: 522593
I realized that we don't actually need to use the inlining feature for this; instead, I can use the hoisting code that's already landed. This patch provides equivalent functionality to the previous one and removes the depedency on the inliner.
Attachment #423016 - Attachment is obsolete: true
Attachment #423591 - Flags: review?(edwsmith)
Attachment #423016 - Flags: review?(edwsmith)
No longer depends on: 522593
Edwin gave me verbal approval to push these two patches with "pending" approval, thus, pushed for now as

http://hg.mozilla.org/tamarin-redux/rev/0058a17d93db
http://hg.mozilla.org/tamarin-redux/rev/5e1db2497d2d
Attachment #422806 - Flags: review?(edwsmith)
Attachment #423591 - Flags: review?(edwsmith)
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Flags: flashplayer-bug-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: