Closed
Bug 458279
Opened 16 years ago
Closed 14 years ago
MOPS opcodes need to be optimized on CodegenLIR/nanojit
Categories
(Tamarin Graveyard :: Virtual Machine, defect, P4)
Tamarin Graveyard
Virtual Machine
Tracking
(Not tracked)
RESOLVED
FIXED
Q3 11 - Serrano
People
(Reporter: stejohns, Assigned: stejohns)
References
Details
Attachments
(2 files, 23 obsolete files)
27.16 KB,
patch
|
Details | Diff | Splinter Review | |
17.42 KB,
patch
|
Details | Diff | Splinter Review |
The fast memory-access opcodes added in https://bugzilla.mozilla.org/show_bug.cgi?id=458277 aren't yet supported in nanojit mode (they fall back to interpreter-only mode).
Assignee | ||
Comment 1•16 years ago
|
||
Correction; as of the latest patch in http://hg.mozilla.org/users/stejohns_adobe.com/tamarin-mops/ they simply generate method calls; however, the x86-32 specialization done in MIR produces substantial performance gains and should be ported to LIR for all architectures.
Updated•16 years ago
|
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Updated•15 years ago
|
Summary: MOPS opcodes need to be implemeted for CodegenLIR/nanojit → MOPS opcodes need to be optimized on CodegenLIR/nanojit
Flags: flashplayer-qrb? → flashplayer-qrb+
Priority: -- → P3
Target Milestone: --- → flash10.x
Updated•15 years ago
|
Priority: P3 → P4
Assignee | ||
Comment 2•15 years ago
|
||
Rough initial patch, totally incomplete, but to get the ball rolling... This patch li8, li16, li32, lf64 (but not lf32) using nanojit. It passes acceptance test on MacTel32 and MacTel64, but has a number of serious deficiencies: -- there's no nanojit op to store 8 or 16 bit values, so these are handled by callouts. (see bug 527083) -- there's no nanojit op to coerce between 32-bit and 64-bit floating point values, so that's handled by callout. (see bug 527246) -- it's only implemented for AVMPLUS_UNALIGNED_ACCESS. The opcodes in question require that unaligned loads/stores work correctly, so architectures that don't allow such operations will have to do byte-at-a-time operations. (Probably via callouts.) -- it doesn't handle byte-swapping. The opcode in question require that the values in memory are always stored little-endian, regardless of the cpu's architecture (thus big-endian machines must swap for loads and stores.) -- the MIR version had an optimization that could collapse adjacent range checks in many cases; this one doesn't, it emits a range check for every load/store. -- the MIR version had an optimization that allowed it to emit actual addresses into the jitted code, which were then patched as necessary later on. this one doesn't, requiring extra reloads from domain->globalMemory every time. -- the MIR version was able to combine load-unsigned+sign-extend operations into a single operation on the backend; this one doesn't, as nanojit has no load-signed primitives (see bug 527085) Obviously, then, this patch is offered more as proof-of-concept, so consider the review to be one of whether the general approach seems sound or not. Fixes to most (if not all) of the above issues will be required prior to landing.
Assignee: nobody → stejohns
Attachment #410995 -
Flags: review?(edwsmith)
Assignee | ||
Comment 3•15 years ago
|
||
see also bug 527258 about whether unaligned-access is correct for ARM or not.
Assignee | ||
Comment 4•15 years ago
|
||
Cleaned-up version of Patch #1; still not ready for review, but closer to useful: This patch has li8, li16, li32, lf64 (but not lf32) and si32, sf64 (but not si8, si16, sf32) using nanojit directly; other operations use callout functions. It passes acceptance test on MacTel32 and MacTel64, but has a number of serious deficiencies: -- we use LIR_ldcb and LIR_ldcs for 8/16 bit loads, but this is arguably wrong; we really need volatile loads to match expected behavior (see bug 527085). Acceptance tests pass, but probably more extensive usage would fail in subtle ways. -- there's no nanojit op to store 8 or 16 bit values, so these are handled by callouts. (see bug 527083) -- there's no nanojit op to coerce between 32-bit and 64-bit floating point values, so that's handled by callout. (see bug 527246) -- AVMPLUS_UNALIGNED_ACCESS and AVMPLUS_BIG_ENDIAN systems always use callouts vs inlined code. This may or may not be acceptable for a final version; I suspect that at least some ops (eg li16/si16) are probably worth inlining on such systems. -- the MIR version had an optimization that could collapse adjacent range checks in many cases; this one doesn't, it emits a range check for every load/store. -- the MIR version had an optimization that allowed it to emit actual addresses into the jitted code, which were then patched as necessary later on. this one doesn't, requiring extra reloads from domain->globalMemory every time. -- the MIR version was able to combine load-unsigned+sign-extend operations into a single operation on the backend; this one doesn't, as nanojit has no load-signed primitives (see bug 527085) So patch is still proof-of-concept, but again, worthy of cursory examination...
Attachment #410995 -
Attachment is obsolete: true
Attachment #411001 -
Flags: review?(edwsmith)
Attachment #410995 -
Flags: review?(edwsmith)
Assignee | ||
Updated•15 years ago
|
Comment 5•15 years ago
|
||
Comment on attachment 411001 [details] [diff] [review] Rough patch #2 initial comments (review removed, too early for + or -) * do we really need duplicate range-check-failed functions? the jit code could call MethodEnv::mopRangeCheckFailed(). or, everyone could use mop_rangeCheckFailed from instr.cpp regardless of VMCFG_JIT_MOPS. (instr.cpp isn't meant to be jit specific). * suggest breaking patch up in a couple layers: * moving code around and into instr.cpp/h * new load/store opcodes (pure nanojit changes) * add feature jit-mops
Attachment #411001 -
Flags: review?(edwsmith)
Assignee | ||
Comment 6•15 years ago
|
||
(In reply to comment #5) > (From update of attachment 411001 [details] [diff] [review]) > initial comments (review removed, too early for + or -) > > * do we really need duplicate range-check-failed functions? the jit code could > call MethodEnv::mopRangeCheckFailed(). or, everyone could use > mop_rangeCheckFailed from instr.cpp regardless of VMCFG_JIT_MOPS. (instr.cpp > isn't meant to be jit specific). yeah, the latter is what I'm going in my working code, all the MethodEnv calls are going to go away. > * suggest breaking patch up in a couple layers: > * moving code around and into instr.cpp/h > * new load/store opcodes (pure nanojit changes) > * add feature jit-mops Good call. At this point I want to get something pretty close to landable, then I'll split it into reviewable chunks.
Assignee | ||
Comment 7•15 years ago
|
||
Re-doing this as smaller patches for review... First is drive-by cleanup in Domain: privatize vars, convert to C99 types, and (most importantly) remove all the memory-protection-permission code. The new jit code probably won't emit the constants in code memory (at least initially), and if we do end up needing it, we need to use the code in nanojit anyway.
Attachment #411001 -
Attachment is obsolete: true
Attachment #411732 -
Flags: review?(edwsmith)
Assignee | ||
Comment 8•15 years ago
|
||
Same comments as before, just attached wrong version of patch
Attachment #411732 -
Attachment is obsolete: true
Attachment #411733 -
Flags: review?(edwsmith)
Attachment #411732 -
Flags: review?(edwsmith)
Assignee | ||
Comment 9•15 years ago
|
||
Next patch (additive to previous): -- move all the mops load/store callouts from methodenv into instr.cpp and update LIR and Interpreter to use them -- the new load/store callouts don't do any range checking, they assume the caller does it. -- added code to LIR to emit range-checks prior to the load/store callouts. this code is naive relative to MIR but will be improved in subsequent patches, to collapse redundant checks. (in the short-term this means that LIR code generation for the MOPS code will be fatter, as the range checks used to be inside the call itself.) -- we allocate a bit of codemgr memory for each method that uses MOPS opcodes to stash a copy of the memory base and size. this requires an extra load (relative to MIR, which inlined the address/size into the instr stream itself) but greatly simplifies the code; I want to actually measure the perf before trying that approach, as it will be tricky to do properly in LIR
Attachment #411746 -
Flags: review?(edwsmith)
Assignee | ||
Comment 10•15 years ago
|
||
only 32-bit int and 64-bit doubles are active now, since nj lacks the other primitives we need. -- I was unsure if I should have AVMFEATURE_JIT_MOPS require UNALIGNED and LITTLE_ENDIAN at the feature level, or implementation level. currently opted for the latter, mainly because UNALIGNED isn't really part of the feature system, so it would make configuring a little awkward. opinions? -- the idea is that when nj implements all the primitives, this will morph to either-or at compile time; this is a temporary compromise.
Attachment #411757 -
Flags: review?(edwsmith)
Assignee | ||
Comment 11•15 years ago
|
||
Optimization: when doing a load/store that is a constant offset from a previous load/store, emit an instruction with the right displacement rather than doing math at runtime. Also modified th mops.abs testcase to exercise this. (It's a common idiom in Alchemy-generated code.)
Attachment #411772 -
Flags: review?(edwsmith)
Updated•15 years ago
|
Attachment #411733 -
Flags: review?(edwsmith) → review+
Comment 12•15 years ago
|
||
CodegenLIR.cpp:4901 are we sure this comment is correct? since size is uint32_t, it could be >2GB, and a negative int could fall in range. but why would "addr" (offset from globalMemoryBase) be signed in the first place; if everything is unsigned then there's no need for the trickery or at least no need for the elaborate explanation? if we support >2GB memory blocks (on a 64bit cpu, say) then this code needs to be conditional and when size is large, we need to do the whole thing with 64bit math (then the 64bit negative mopAddr value will wrap around and still be larger than the large 32 bit size constant). line 4912: if mopsMemoryBase is not pointing to a GC object, then we can use LIR_piadd here instead of addp. LIR_piadd is a pure add operator whereas LIR_addp is restricted from certian optimizations that can leave dangling interior pointers. line 4888: this would all be somewhat cleaner if the mops patching code in Domain were moved into CodegenLIR, maybe with the patching lists living on CodeMgr goes away and so do the necessary patching lists. and this way when the jit changes and we change how patching is done, it ought to not cause churn in class Domain.
Comment 13•15 years ago
|
||
Comment on attachment 411746 [details] [diff] [review] Refactor mops callouts R+ as is (no bugs found); if the recommend cleanups are implemented, only re-review if you feel it changed enough to require a second look.
Attachment #411746 -
Flags: review?(edwsmith) → review+
Comment 14•15 years ago
|
||
(In reply to comment #12) > if we support >2GB memory blocks (on a 64bit cpu, say) The largest memory block supported by MMgc is 2^32-1 bytes, in practice this means 2^32-8 bytes, and that's including headers and so on, so the maximum user payload is 2^32-4096, probably. This is true on 64-bit systems as well.
Comment 15•15 years ago
|
||
Comment on attachment 411757 [details] [diff] [review] Use exiting nanojit primitives instead of callouts I'm not convinced the extra fields in MopsInfo and the branchy ifdef-y logic when emitting loads and stores is worth the trouble. why not split out the cases that we can support with inline loads/stores (right now, int32 and double), and factor so that shared code is in a helper subroutine?
Attachment #411757 -
Flags: review?(edwsmith) → review-
Comment 16•15 years ago
|
||
Comment on attachment 411772 [details] [diff] [review] Optimize loads with constant offsets whitespace is getting messed up
Assignee | ||
Comment 17•15 years ago
|
||
(In reply to comment #12) > CodegenLIR.cpp:4901 are we sure this comment is correct? since size is > uint32_t, it could be >2GB, and a negative int could fall in range. but why > would "addr" (offset from globalMemoryBase) be signed in the first place; if > everything is unsigned then there's no need for the trickery or at least no > need for the elaborate explanation? I talked about this with Scott Peterson offline... in the existing MIR version, the offset is verified as INT_TYPE rather than UINT_TYPE, and he was of the opinion that offsets >2GB probably wouldn't work. So I'm going to review the patch to use int32 uniformly for offset, and check for negative values. (Yes, this sucks, but we're likely stuck with it now.) > if we support >2GB memory blocks (on a 64bit cpu, say) then this code needs to > be conditional and when size is large, we need to do the whole thing with 64bit > math (then the 64bit negative mopAddr value will wrap around and still be > larger than the large 32 bit size constant). MMGC doesn't support blocks larger than 4GB even on 64bit, but the signedness of offset moots even this. > line 4912: if mopsMemoryBase is not pointing to a GC object, then we can use > LIR_piadd here instead of addp. LIR_piadd is a pure add operator whereas > LIR_addp is restricted from certian optimizations that can leave dangling > interior pointers. mopsMemoryBase is pointing to either: -- storage that lives inside ByteArray, which is allocated via mmfx_new -- Domain::globalMemoryScratch, which is embedded inside class Domain we could easily change globalMemoryScratch to also be mmfx_new allocated, then change this op... > line 4888: this would all be somewhat cleaner if the mops patching code in > Domain were moved into CodegenLIR, maybe with the patching lists living on > CodeMgr goes away and so do the necessary patching lists. and this way when > the jit changes and we change how patching is done, it ought to not cause churn > in class Domain. Yeah. Probably so. Let me look at doing that.
Assignee | ||
Comment 18•15 years ago
|
||
(In reply to comment #15) > (From update of attachment 411757 [details] [diff] [review]) > I'm not convinced the extra fields in MopsInfo and the branchy ifdef-y logic > when emitting loads and stores is worth the trouble. yeah, it's ugly. I've already re-thought it in a private branch, I'll clean up and resubmit.
Assignee | ||
Comment 19•15 years ago
|
||
Comment on attachment 411733 [details] [diff] [review] Correct Domain patch pushed to redux as changeset: 3087:68b17ac5150f
Attachment #411733 -
Attachment is obsolete: true
Assignee | ||
Comment 20•15 years ago
|
||
Comment on attachment 411746 [details] [diff] [review] Refactor mops callouts pushed to redux as changeset: 3088:ffcc8860bac9
Attachment #411746 -
Attachment is obsolete: true
Assignee | ||
Comment 21•15 years ago
|
||
(In reply to comment #12) > line 4888: this would all be somewhat cleaner if the mops patching code in > Domain were moved into CodegenLIR, maybe with the patching lists living on > CodeMgr goes away and so do the necessary patching lists. and this way when > the jit changes and we change how patching is done, it ought to not cause churn > in class Domain. So, I looked into doing this, and it's an ugly rathole... Domain and ByteArray are tightly coupled in an unpleasant way... and Flash/AIR are using a slightly different variant of ByteArray (vs the one in tamarin/shell). I'm going to take some baby steps to decouple them a little bit, but we may not want to move this into CodeMgr/JIT/etc just yet until we untangle it enough to fully understand the implications...
Assignee | ||
Comment 22•15 years ago
|
||
More Domain cleanup to try to decouple ByteArray from Domain, at least a little bit, along with drive-by cleanup of privatizing everything. The fundamental issue is that Domain needs to work with two different "ByteArray" objects, with completely different AS3 and C++ inheritances... we currently have everything just right so that they both work, but that's not sustainable. Long-term, we really need to unify them (in Tamarin) but that's far too ticklish for the short-term schedule; thus the introduction of ScriptObject::getGlobalMemoryProvider as the communications channel. Yes, it's evil, but we have precedent (and it's also far more robust than the old isMemoryObject method was). Also changed the scratch memory to use mmfx_new memory to pave the way for the jit addition optimization mentioned in an earlier comment. Lot of churn here (sorry bout that) but it's long overdue IMHO.
Attachment #411757 -
Attachment is obsolete: true
Attachment #411772 -
Attachment is obsolete: true
Attachment #412116 -
Flags: review?(edwsmith)
Attachment #411772 -
Flags: review?(edwsmith)
Updated•15 years ago
|
Attachment #412116 -
Flags: review?(edwsmith) → review+
Assignee | ||
Comment 23•15 years ago
|
||
Comment on attachment 412116 [details] [diff] [review] More Domain cleanup, ByteArray decoupling pushed as changeset: 3186:9882a0bbd3fd
Attachment #412116 -
Attachment is obsolete: true
Assignee | ||
Comment 24•15 years ago
|
||
nanojit only defines this for i386 currently. MOPS requires little-endian unaligned access, so only platforms that can support this will ever be supported directly, regardless of nanojit support (so x64 and later ARM chips will eventually be enabled, PPC/Sparc probably not)
Attachment #415929 -
Flags: review?(edwsmith)
Comment 25•15 years ago
|
||
Comment on attachment 415929 [details] [diff] [review] Have LIR use NJ_EXPANDED_LOADSTORE_SUPPORTED ops when appropriate > (so x64 and later ARM chips will eventually be enabled, PPC/Sparc probably not) support for unaligned access is the hard requirement. we can support big endian cpu's if they can do unaligned memory access and if they have a cheap byte-swap instruction. (ppc has one). * NJ_EXPANDED_LOADSTORE_SUPPORTED is not exactly the right flag to use for loads we support on all platforms (32bit ints, 64bit floats), and at the same time, we need something that ensures we only use LIR ops for MOPS ops when the CPU has the proper semantics. (little-endian, support-for-unaligned access) If it is feasible to factor the range-check optimizations into a LirWriter filter then that would be preferrable. Plus, comments explaining the transformations being performed in mopAddrToRangeCheckedRealAddrAndDisp() would help. I am having a hard time following the code. (e.g. before/after examples, notes on the context for the optimziation (basic blocks only?) ) (still studying the code looking for bugs, but that's enough feedback for now I hope).
Attachment #415929 -
Flags: review?(edwsmith) → review-
Comment 26•15 years ago
|
||
(In reply to comment #25) > If it is feasible to factor the range-check optimizations into a LirWriter > filter then that would be preferrable. Nevermind about LirWriter, I'm looking at the applied patch in context now and its not bad as-is. I can still imagine a filter that generalizes the optimization we're doing but its probably not worth making a filter just for what we have (one callout in 4 spots, is not a big deal).
Comment 27•15 years ago
|
||
The sign-extend folding could cause loads to move relative to stores: if (opcode == OP_sxi8 && val->opcode() == LIR_ldzb) { // optimize sxi8(li8) -> LIR_ldbx val->initLInsLd(LIR_ldsb, val->oprnd1(), val->disp()); break; } here's an ABC sketch. assume address "A" has an interesting value "x": li8 A ( x ) pushint 0 ( x 0 ) si8 A ( x ) // now memory[A] == 0 sxi8 ( x' ) the optimization will emit this LIR without the optimization: ldzb1 = ldzb A stb 0, A // now memory[A] == 0 result = sxi8 ldzb1 // do the sign extend of the loaded value here is LIR with the optimization: ldzb1 = ldzb A // dead code stb 0, A // memory[A] == 0 result = ldsb A // loads 0 The fix is something that ensures there are no intervening stores. the simplest fix is to require the OP_sxi8 be immediately after the load in question. you can't detect that just by looking at the operand, something needs to look at the sequence of instructions, peephole-style.
Assignee | ||
Comment 28•15 years ago
|
||
(In reply to comment #25) > support for unaligned access is the hard requirement. we can support big > endian cpu's if they can do unaligned memory access and if they have a cheap > byte-swap instruction. (ppc has one). yeah, but it doesn't support unaligned access, does it? (not sure) > * NJ_EXPANDED_LOADSTORE_SUPPORTED is not exactly the right flag to use for > loads we support on all platforms (32bit ints, 64bit floats), and at the same > time, we need something that ensures we only use LIR ops for MOPS ops when the > CPU has the proper semantics. (little-endian, support-for-unaligned access) well, yeah, that's why it's combined with checks for LITTLE_ENDIAND and UNALIGNED_ACCESS > Plus, comments explaining the > transformations being performed in > mopAddrToRangeCheckedRealAddrAndDisp() would > help. I am having a hard time following the code. (e.g. before/after > examples, notes on the context for the optimziation (basic blocks only?) ) I'll see what I can do. (In reply to comment #27) > The sign-extend folding could cause loads to move relative to stores: Nice catch. I'll resubmit.
Assignee | ||
Comment 29•15 years ago
|
||
(In reply to comment #27) > The fix is something that ensures there are no intervening stores. the > simplest fix is to require the OP_sxi8 be immediately after the load in > question. you can't detect that just by looking at the operand, something > needs to look at the sequence of instructions, peephole-style. Gah. peephole-style optimizations are perversely painful to do in nanojit... there's no generic "prevIns" call (we'd need to link LirBuffer in both directions to do this generally), and both LirWriter and LirReader are only well-suited to single-insn-at-a-time work. I can probably find a way to work around it in this case but this isn't the only case I've seen... we should think if there's a better way of doing instruction-sequence optimizations.
Comment 30•15 years ago
|
||
(In reply to comment #28) > (In reply to comment #25) > > support for unaligned access is the hard requirement. we can support big > > endian cpu's if they can do unaligned memory access and if they have a cheap > > byte-swap instruction. (ppc has one). > > yeah, but it doesn't support unaligned access, does it? (not sure) Steven might be thinking of the load word with reversed bytes instruction (lwbrx), I assume too it needs to be aligned but I haven't checked. Warren [1] notes that you can implement in-register byte reversal on PPC with three instructions (rotate followed by two rotate left with mask insert, details absent). Sparc does not have a byte-swap instruction from what I can tell (only checked the v9 instruction set). The best I can think of is (x >> 24) | (x << 24) | ((x & 0x00FF0000) >> 8) | ((x << 8) & 0x00FF0000) which is 10 instructions and two temp registers on SPARC (and untested!) On MIPS one should be able to use 0x0000FF00 analogously and avoid one temp register and the SETHI instruction (since immediates are 16 bits IIRC). [1] Henry S Warren, "Hacker's Delight", Addison Wesley 2003.
Comment 31•15 years ago
|
||
(In reply to comment #29) > (In reply to comment #27) > > The fix is something that ensures there are no intervening stores. the > > simplest fix is to require the OP_sxi8 be immediately after the load in > > question. you can't detect that just by looking at the operand, something > > needs to look at the sequence of instructions, peephole-style. > > Gah. peephole-style optimizations are perversely painful to do in nanojit... > there's no generic "prevIns" call (we'd need to link LirBuffer in both > directions to do this generally), and both LirWriter and LirReader are only > well-suited to single-insn-at-a-time work. I can probably find a way to work > around it in this case but this isn't the only case I've seen... we should > think if there's a better way of doing instruction-sequence optimizations. It might be a lot easier to do the peephole analysis on the ABC instruction sequence. The CodeWriter interface allows pipelined operations, it might be possible to introduce a peephole analyzer. Since WordcodeEmitter includes a peephole optimizer, you might get ideas from that, although it is much more general.
Assignee | ||
Comment 32•15 years ago
|
||
(In reply to comment #31) > It might be a lot easier to do the peephole analysis on the ABC instruction > sequence. The CodeWriter interface allows pipelined operations, it might be > possible to introduce a peephole analyzer. Since WordcodeEmitter includes a > peephole optimizer, you might get ideas from that, although it is much more > general. Yeah, that definitely will be a simpler way to solve at least one of my problems (li8+sxi8 fusing), but I don't think it is amenable to some others (collapsing redundant range checks)... but let me ponder if there's a clever way to deal with that. In any event my complaint still stands: peephole opt at the LIR level would be useful but currently is painful.
Assignee | ||
Comment 33•15 years ago
|
||
Revised version of patch: -- expand commenting in mopAddrToRangeCheckedRealAddrAndDisp(), and remove incorrect optimization we had before (for const-expr) -- remove entirely the optimizations for sxi8 and sxi16 (these will be restored in a subsequent patch, probably at the ABC level)
Attachment #415929 -
Attachment is obsolete: true
Attachment #416435 -
Flags: review?(edwsmith)
Assignee | ||
Comment 34•15 years ago
|
||
(additive to patch 416435) specialize li8+sxi8 and li16+sxi16 pairs in LIR and wordcode. This borrows the approach used by findpropglobal and findpropglobalstrict (ie, uses empty space in the ABC opcode range).
Attachment #416456 -
Flags: review?(edwsmith)
Assignee | ||
Comment 35•15 years ago
|
||
Per earlier comment: since mops memory is not a GCObject, we can use piadd instead of addp.
Attachment #416483 -
Flags: review?(edwsmith)
Assignee | ||
Comment 36•15 years ago
|
||
A little algebra and signed/unsigned trickery allows us to reduce all range checks to a single compare. This makes a dramatic speed difference, even without trying to collapse redundant checks within a single basic block.
Attachment #416495 -
Flags: review?(edwsmith)
Assignee | ||
Comment 37•15 years ago
|
||
May seem counterintuitive, but the mopAddr can legitimately be negative, as it can be part of an earlier calculation (with a positive constant displacement) due to the optimization earlier in the function.
Attachment #416812 -
Flags: review?(edwsmith)
Updated•15 years ago
|
Attachment #416483 -
Flags: review?(edwsmith) → review+
Updated•15 years ago
|
Attachment #416812 -
Flags: review?(edwsmith) → review+
Comment 38•15 years ago
|
||
Comment on attachment 416435 [details] [diff] [review] Have LIR use NJ_EXPANDED_LOADSTORE_SUPPORTED ops when appropriate, v2 * needs an assert (and explanation) somewhere prominent that the mops memory size is <2GB * I had to refactor and comment the code using sumFitsInInt32 a bunch before i grokked what was happening. the result is only slightly clearer, a comment at the end of each branch would help.... meh. * ditto, in the code emitting the range check, a comment that shows the emitted code in C-like-syntax is extremely helpful. e.g., for the block at CodegenLIR::4986: // if (uint32(mopAddr) > uint32(mopsMemorySize - size)) goto rangeCheckFail_label .. that said, we dont need another round of review, just as much clarity as you can stomach, in a few key places. what makes review hard is getting the context in your head for what the big picture is for these mops. maybe a separate doc would help with that. not critical-path for these patches.
Attachment #416435 -
Flags: review?(edwsmith) → review+
Comment 39•15 years ago
|
||
* stray commented-out code in AvmCore.cpp? * instr.h: needs comments to clarify that li8/16 are zero extended (or rename to liz8/16 for symmetry?) * Verifier.cpp:1792 the peephole logic can hiccup if there is a branch target without a label inbetween OP_li* and OP_sx*, which is possible in the context of forwards jumps: li8 &x ( -- x ) ... ( -- x bool ) N: iftrue N+2 ( -- x ) N+1: pop ( -- ) N+1: li8 &y ( -- y ) N+2: sxi8 ( -- sx(y) | sx(x) } you can test for the presence of an implicit label using (blockStates ? blockStates->get(pc+1) != NULL : false) (explicit labels are no problem, the peepholer will see the label instruction). * need to review same idea, but what if a try/catch range ends right inbetween the two instructions. i think this is harmless since it doesn't imply a jump, but it would be smart to sketch an example and argue it's safe. (or defer the analysis and only do the optimization for no-try-catch methods)... heck, it could a nearly-nonexistant pattern in Alchemy output. * nice to have, if feasable: encapsulate the peephole logic in a separate CodeWriter.
Updated•15 years ago
|
Attachment #416456 -
Flags: review?(edwsmith) → review-
Comment 40•15 years ago
|
||
> // we want to fail range-check if
> // (mopAddr + disp < 0 || mopAddr + disp + size > mopsMemorySize)
^^^
I think the test should be >=
(mopAddr + disp < 0 || mopAddr + disp + size >= mopsMemorySize)
then, let A=mopAddr, D=disp, S=size, M=mopsMemorySize:
A + D < 0 || A + D + S >= M
A < -D || A >= M - D - S
!(A >= -D && A < M - D - S)
apply rule: (x >= min && x < max) ==> (unsigned(x-min) < unsigned(max-min))
!(unsigned(A - -D) < unsigned(M-D-S - -D))
!(unsigned(A + D) < unsigned(M-D-S + D))
!(unsigned(A + D) < unsigned(M - S))
unsigned(A + D) >= unsigned(M - S)
if you buy all this, then in the code, CodegenMIR::5029: should s/LIR_ugt/LIR_uge.
lastly, D and S (i.e. disp and size) are constant. is this valid or does int wraparound invalidate it:
unsigned(A + D) >= unsigned(M - S)
unsigned(A + D - D) >= unsigned(M - S - D)
unsigned(A) >= unsigned(M - const(S+D))
Updated•15 years ago
|
Attachment #416495 -
Flags: review?(edwsmith) → review-
Comment 41•15 years ago
|
||
(In reply to comment #40) > > // we want to fail range-check if > > // (mopAddr + disp < 0 || mopAddr + disp + size > mopsMemorySize) > ^^^ > I think the test should be >= sigh, no it shouldn't. code is correct as-is. the final question above about reassociating the constants still applies, though.
Updated•15 years ago
|
Attachment #416495 -
Flags: review- → review+
Assignee | ||
Comment 42•15 years ago
|
||
(In reply to comment #40) > lastly, D and S (i.e. disp and size) are constant. is this valid or does int > wraparound invalidate it: > > unsigned(A + D) >= unsigned(M - S) > unsigned(A + D - D) >= unsigned(M - S - D) > unsigned(A) >= unsigned(M - const(S+D)) IIRC wraparound invalidates it; I tried that and had failures in obscure cases (I can dig 'em up if you are curious)
Comment 43•15 years ago
|
||
no need, it's wrong by inspection if A==0
Assignee | ||
Comment 44•15 years ago
|
||
(In reply to comment #39) > * stray commented-out code in AvmCore.cpp? oops > * instr.h: needs comments to clarify that li8/16 are zero extended (or rename > to liz8/16 for symmetry?) cool > * Verifier.cpp:1792 the peephole logic can hiccup if there is a branch target possible, yes, but as a pragmatic matter, Alchemy seems to always emit these two instructions in lockstep. I'll see if I can fix it, but testing will be problematic > * need to review same idea, but what if a try/catch range ends right inbetween > the two instructions. see above. (Note that missing these is harmless, just less efficient since we load and then explicitly sign extend) > * nice to have, if feasable: encapsulate the peephole logic in a separate > CodeWriter. Feasible but IMHO overkill; I looked at doing so but for this simple case it's far less code to take the easy way out. If we start adding other peephole optimizations then yeah, but for this one I think simple is good.
Assignee | ||
Comment 45•15 years ago
|
||
With (some of) Edwin's suggestions done
Attachment #416456 -
Attachment is obsolete: true
Attachment #416973 -
Flags: review?(edwsmith)
Assignee | ||
Comment 46•15 years ago
|
||
(In reply to comment #41) > sigh, no it shouldn't. code is correct as-is. the final question above about > reassociating the constants still applies, though. sadly, some exhaustive testing reveals flaws; consider mopAddr=-2147483648 disp=-8 size=8 mopsMemorySize=8 which yields false for the old test but true for the optimized test.
Assignee | ||
Comment 47•15 years ago
|
||
(In reply to comment #46) It seems there are unlikely-but-possible edge cases the range checking doesn't handle... I'm going to back off to a less aggressive version for now, and research possible improvements for future patches
Assignee | ||
Comment 48•15 years ago
|
||
Comment on attachment 416495 [details] [diff] [review] Smarten the range-check Not convinced the optimization works for all numeric combinations. Will investigate and resubmit.
Attachment #416495 -
Attachment is obsolete: true
Assignee | ||
Comment 49•15 years ago
|
||
Comment on attachment 416435 [details] [diff] [review] Have LIR use NJ_EXPANDED_LOADSTORE_SUPPORTED ops when appropriate, v2 pushed (with minor mods) as part of changeset: 3295:b1842cff6d9f
Attachment #416435 -
Attachment is obsolete: true
Assignee | ||
Comment 50•15 years ago
|
||
Comment on attachment 416483 [details] [diff] [review] Use piadd instead of addp pushed as part of changeset: 3295:b1842cff6d9f
Attachment #416483 -
Attachment is obsolete: true
Assignee | ||
Comment 51•15 years ago
|
||
Comment on attachment 416812 [details] [diff] [review] Address calc should use i2p, not u2p pushed as part of changeset: 3295:b1842cff6d9f
Attachment #416812 -
Attachment is obsolete: true
Assignee | ||
Comment 52•15 years ago
|
||
(In reply to comment #46) > sadly, some exhaustive testing reveals flaws flawed algebra, alas: my patch asserts that (mopAddr + disp < 0) == (mopAddr < -disp) but the LHS above should really be written as (mopAddr+disp)%MAXINT < 0 (to handle the overflow/underflow cases), which isn't equivalent to the RHS.
Comment 53•15 years ago
|
||
(In reply to comment #51) > (From update of attachment 416812 [details] [diff] [review]) > pushed as part of changeset: 3295:b1842cff6d9f This patch is failing to compile on windows64, and has currently closed the tamarin-redux tree: core\CodegenLIR.cpp(2920) : error C2220: warning treated as error - no 'object' file generated core\CodegenLIR.cpp(2920) : warning C4267: 'argument' : conversion from 'size_t' to 'int32_t', possible loss of data core\CodegenLIR.cpp(2930) : warning C4267: 'argument' : conversion from 'size_t' to 'int32_t', possible loss of data core\CodegenLIR.cpp(2938) : warning C4267: 'argument' : conversion from 'size_t' to 'int32_t', possible loss of data
Comment 54•15 years ago
|
||
(In reply to comment #53) Disregard comment #53, I now see that Steven reverted the changes in 3295
Comment 55•15 years ago
|
||
(In reply to comment #39) > * Verifier.cpp:1792 the peephole logic can hiccup if there is a branch target > without a label inbetween OP_li* and OP_sx*, which is possible in the context > of forwards jumps: > li8 &x ( -- x ) > ... ( -- x bool ) > N: iftrue N+2 ( -- x ) > N+1: pop ( -- ) > N+1: li8 &y ( -- y ) > N+2: sxi8 ( -- sx(y) | sx(x) } I could have been clearer about how this example would be broken: here's the output after peephole optimization: li8 &x ( -- x ) ... ( -- x bool ) N: iftrue N+2 ( -- x ) N+1: pop ( -- ) N+1: lix8 &y ( -- y ) /* also advances verify-time pc to N+3 */ N+2: ??? ( -- sx(y) | sx(x) } N+3: ... In each bytecode, the verifier checks to see if PC == an implicit label (see the if block after "bool unreachable = false".). Advancing pc past this point means the label won't be processed at all, which surely can cause problems. At the very least, the sxi8 instruction won't be emitted at all, since we skipped it, which means it will not execute on the path when branch at N:iftrue is taken. possible fix: do not skip the sxi8. in the JIT, if the input to sxi8 is lxi8, then the sign-extend is a no-op and the load can be directly referenced. in the test example, the JIT will have processed a new basic block starting at N+2, and will not ignore the sxi8 (since there are two control flow paths leading to it, thus two possible input values). but in typical code, the lix8 dominates the sxi8, the input to sxi8 is known to be the load, and the jit can drop the redundant sxi8.
Updated•15 years ago
|
Attachment #416973 -
Flags: review?(edwsmith) → review-
Assignee | ||
Comment 56•15 years ago
|
||
> possible fix: do not skip the sxi8
nice thought. I'll cook up an example and verify that what we expect is
happening.
Assignee | ||
Comment 57•15 years ago
|
||
Revised to handle the implicit-label issue Ed pointed out. Also added testcase.
Attachment #416973 -
Attachment is obsolete: true
Attachment #417167 -
Flags: review?(edwsmith)
Updated•15 years ago
|
Attachment #417167 -
Flags: review?(edwsmith) → review+
Assignee | ||
Comment 58•15 years ago
|
||
Further tweaks: -- loading globalMemoryBase/Size via LIR_ldc is not right, since we need to reload if any user call could have changed the domain's memory. Added a new Filter to maintain this and reload as necessary. -- track the range of active checks, so any that are provably redundant can be eliminated.
Attachment #417803 -
Flags: review?(edwsmith)
Assignee | ||
Comment 59•15 years ago
|
||
Comment on attachment 417167 [details] [diff] [review] Specialize load-and-sign-extend, take 3 pushed as 3316 19da674be2ef
Attachment #417167 -
Attachment is obsolete: true
Comment 60•15 years ago
|
||
two nits: It's not necessary to override LirWriter methods that just call out->whatever, since that's what the base methods do in LirWriter. the code that uses and updates mopsRangeCheckFilter.activeMin/Max would be cleaner if it were encapsulated in the filter and just called from CodegenLIR... in the long run its much easier to think about filter behaviors when their internal state is well encapsulated. but... can't find anything wrong with the logic, so as long as tests pass (etc), push at will.
Updated•15 years ago
|
Attachment #417803 -
Flags: review?(edwsmith) → review+
Assignee | ||
Comment 61•15 years ago
|
||
I'll do some cleanup work before pushing.
Assignee | ||
Comment 62•15 years ago
|
||
Patch with Edwin's suggestions... since buildbot is down today I don't want to push it yet, so I might as well get a re-review :-)
Attachment #417803 -
Attachment is obsolete: true
Attachment #418226 -
Flags: review?(edwsmith)
Assignee | ||
Comment 63•15 years ago
|
||
Let's try that again... wrong patch uploaded, with incorrect range-checking code.
Attachment #418226 -
Attachment is obsolete: true
Attachment #418233 -
Flags: review?(edwsmith)
Attachment #418226 -
Flags: review?(edwsmith)
Updated•15 years ago
|
Attachment #418233 -
Flags: review?(edwsmith) → review+
Comment 64•15 years ago
|
||
Comment on attachment 418233 [details] [diff] [review] 418226: Further range-check improvements, v3 technically we could break content in the future by either raising the minimim mops memory size or lowering it. i'd guess its more likely to break by raising it. should we just set it to 1K and forget? (no need to re-review for that tho)
Assignee | ||
Comment 65•15 years ago
|
||
Take advantage of the fact that (a > b) == (a >= b-1) for ints (if b>0 && b < ffffffff) to avoid a subtraction
Attachment #418446 -
Flags: review?(edwsmith)
Assignee | ||
Comment 66•15 years ago
|
||
Comment on attachment 418233 [details] [diff] [review] 418226: Further range-check improvements, v3 http://hg.mozilla.org/tamarin-redux/rev/eb98f48cef15
Attachment #418233 -
Attachment is obsolete: true
Comment 67•15 years ago
|
||
(In reply to comment #24) > (PPC/Sparc probably not) this test program compiles and runs fine on a G5 10.5 machine: #include <stdio.h> #include <stdint.h> int main(int argc, char **argv) { char data[100] = { 1, 2, 3, 4, 5, 6, 7, 8, 9 }; printf("data %p\n", data); int* intp = (int*)(data+1); short* shortp = (short*)(data+1); int64_t* int64p = (int64_t*)(data+1); double* doublep = (double*)(data+1); printf("unaligned int %d\n", *intp); printf("unaligned short %d\n", *shortp); printf("unaligned int64 %d\n", *int64p); printf("unaligned double %f\n", *doublep); return 0; } not sure if this goes back to G4 or G3 machines, but it maybe we only need to do byte-swapping to support inlined mops on ppc. lwbrx does it with one instruction. if there are equvalents for other loads (float, short), then we may pass GO with no added instructions either.
Assignee | ||
Comment 68•15 years ago
|
||
The MOPS globalmemory handling is currently rooted in Domain, but really ought to be rooted in DomainEnv, per offline discussions. This corrects that. Note that this change reduces MOPS performance.
Attachment #419084 -
Flags: review?(edwsmith)
Assignee | ||
Comment 69•15 years ago
|
||
This restructures internal fields in MethodEnv to allow for faster access to the now-DomainEnv-based globalmemory. With both these patches in place, we are back to ~95% or so of the performance we had before (but still not quite at the average performance level of the MIR-based MOPS code in FP10).
Attachment #419085 -
Flags: review?(edwsmith)
Assignee | ||
Comment 70•15 years ago
|
||
Comment on attachment 419085 [details] [diff] [review] Rearrange MethodEnv::Aux to improve MOPS performance This patch works, but is a little conceptually icky (it's definitely infiltrating concerns pretty far down). I think if we could do the op-hoisting as described by Edwin in https://bugzilla.mozilla.org/show_bug.cgi?id=533546#c9 we could avoid the need for this entirely... we could then ensure that DomainEnv is loaded once at the start of a MOPS-using method, and get essentially the same benefit with less cruft.
Updated•15 years ago
|
Attachment #418446 -
Flags: review?(edwsmith) → review+
Updated•15 years ago
|
Attachment #419084 -
Flags: review?(edwsmith) → review+
Assignee | ||
Comment 71•15 years ago
|
||
Comment on attachment 419084 [details] [diff] [review] Move MOPS memory handling to DomainEnv I'm going to hold off pushing this one pending further investigation -- we think there might be a better approach than the subsequent patch to make up the lost performance.
Assignee | ||
Comment 72•15 years ago
|
||
Comment on attachment 418446 [details] [diff] [review] More range-check tweaks pushed to TR as changeset: 3425:f184f146c647
Attachment #418446 -
Attachment is obsolete: true
Comment 73•15 years ago
|
||
Comment on attachment 419085 [details] [diff] [review] Rearrange MethodEnv::Aux to improve MOPS performance looks like hoisting wins
Attachment #419085 -
Flags: review?(edwsmith) → review-
Assignee | ||
Comment 74•14 years ago
|
||
Updated, rebased version of this patch. Note that this patch causes some perf regression (which will be largely mitigated by a subsequent patch)
Attachment #419084 -
Attachment is obsolete: true
Attachment #422806 -
Flags: review?(edwsmith)
Assignee | ||
Updated•14 years ago
|
Attachment #422806 -
Attachment is obsolete: true
Attachment #422806 -
Flags: review?(edwsmith)
Assignee | ||
Comment 75•14 years ago
|
||
Comment on attachment 422806 [details] [diff] [review] Move MOPS memory handling to DomainEnv, v2 Marking obsolete; the nanojit hoisting approach is cleaner and already in place
Assignee | ||
Updated•14 years ago
|
Attachment #419085 -
Attachment is obsolete: true
Assignee | ||
Comment 76•14 years ago
|
||
Comment on attachment 422806 [details] [diff] [review] Move MOPS memory handling to DomainEnv, v2 I marked the wrong patch obsolete... de-obsoleting and re-requesting review
Attachment #422806 -
Attachment is obsolete: false
Attachment #422806 -
Flags: review?(edwsmith)
Assignee | ||
Comment 77•14 years ago
|
||
Smarten range checking to avoid redundant checks. This patch relies on Bug 522593 having landed.
Attachment #423016 -
Flags: review?(edwsmith)
Assignee | ||
Comment 78•14 years ago
|
||
Edwin... I know this depends on bug 522593, but if you have bandwidth to look over this in advance of that, let me know.
Assignee | ||
Comment 79•14 years ago
|
||
I realized that we don't actually need to use the inlining feature for this; instead, I can use the hoisting code that's already landed. This patch provides equivalent functionality to the previous one and removes the depedency on the inliner.
Attachment #423016 -
Attachment is obsolete: true
Attachment #423591 -
Flags: review?(edwsmith)
Attachment #423016 -
Flags: review?(edwsmith)
Assignee | ||
Comment 80•14 years ago
|
||
Edwin gave me verbal approval to push these two patches with "pending" approval, thus, pushed for now as http://hg.mozilla.org/tamarin-redux/rev/0058a17d93db http://hg.mozilla.org/tamarin-redux/rev/5e1db2497d2d
Assignee | ||
Updated•14 years ago
|
Attachment #422806 -
Flags: review?(edwsmith)
Assignee | ||
Updated•14 years ago
|
Attachment #423591 -
Flags: review?(edwsmith)
Assignee | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Flags: flashplayer-bug-
You need to log in
before you can comment on or make changes to this bug.
Description
•