Closed Bug 584935 Opened 9 years ago Closed 9 years ago

Speed up register iteration loops by using BSF instruction

Categories

(Core Graveyard :: Nanojit, defect, P3, minor)

x86
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED
flash10.2

People

(Reporter: edwsmith, Assigned: edwsmith)

References

Details

(Whiteboard: PACMAN, fixed-in-nanojit)

Attachments

(1 file, 1 obsolete file)

Attached patch proof of concept on x64 (obsolete) — Splinter Review
On x86 or x64, we can use BSF to skip regions of 0s in a bitmask, to quickly iterate through register masks that satisfy various criteria.  For example:

int active = ~free & managed
for (int r=bsf(active); active != 0; i = bsf(active &= ~(1<<r)))
  /* do stuff with _allocator.active[r] */

The same principle ought to work in bug 584608 for quickly finding free AR entries, although I haven't tried that yet.

Preliminary benchmarking on tamarin-redux's jit-only mode (dont run code, just feed a large input set through the JIT) shows about a 3-5% speedup in TR's verify+jit time.  That time would be diluted in the real world, of course.

The attached proof of concept also includes the bitmask-based AR from bug 584608.
Attachment #463412 - Attachment is patch: true
Attachment #463412 - Attachment mime type: application/octet-stream → text/plain
Attachment #463412 - Flags: feedback?(nnethercote)
Comment on attachment 463412 [details] [diff] [review]
proof of concept on x64

I definitely like the basic idea of the patch, hence the f+.  But I have a
few things I'd like changed.

First, I'd like to see nextreg()/prevreg() completely gone, I'd like only
one way to iterate over a loop.  Remaining uses:

- RegAlloc::countActive()
- Assembler::registerConsistencyCheck()
- Assembler::printRegState()
- Assembler::intersectRegisterState() -- in a comment;  just delete the
  whole comment line beginning with "viz."


>+        int active_set = _allocator.activeSet();
>+        for (Register r = lsReg(active_set); active_set;
>+             r = lsReg(active_set &= ~rmask(r)))

I don't like having to write the masking code in every loop.  I'd prefer
something like this:

        for (Register r = lsReg(active_set); active_set;
             r = nextLsReg(active_set, r))

where nextLsReg() is suitably defined.


>+            AvmAssert(ins != 0);

Nit: Use NanoAssert(), please, TM doesn't define AvmAssert().


>     class AR
>     {
>     private:
>         uint32_t        _highWaterMark;                 /* index of highest entry used since last clear() */
>+        uint32_t        _used[NJ_MAX_STACK_ENTRY/(8*sizeof(uint32_t))];
>+#if defined(_DEBUG)
>         LIns*           _entries[ NJ_MAX_STACK_ENTRY ]; /* maps to 4B contiguous locations relative to the frame pointer.
>                                                             NB: _entries[0] is always unused */
>+#endif

I'd like a comment explaining _used, that it's a bitset implemented using
32-bit values.  I'd also like a comment explaining the connection between
_used and _entries;  and since they are so closely linked their names should
be more similar, maybe _usedBits and _usedInstrs?


>+    // return the lowest numbered Register in mask
>+    inline Register lsReg(RegisterMask set) {
>+    #if defined _MSC_VER
>+        DWORD tr;
>+        _BitScanForward(&tr, set);
>+        return (int) tr;
>+    #else
>+        return Register(__builtin_ffs(set) - 1);

I think this is better:

  return Register(__builtin_ctz(set | 1))

The |1 is because the result is undefined if the operand is zero.

>+    #endif
>+    }
>+
>+    // return the highest numbered Register in mask
>+    inline Register msReg(RegisterMask set) {
>+    #if defined _MSC_VER
>+        DWORD tr;
>+        _BitScanReverse(&tr, set);
>+        return (int) tr;
>+    #else
>+        return Register(31 - __builtin_clz(set));

Similarly, you probably want (set | 0x80000000) here, because
__builtin_clz's result is undefined if the argument is zero.

>+    #endif
>+    }

Several problems here:

- You're assuming either MSVC or GCC;  we should have a fallback for other
  compilers, like is done in msbSet().

- This should not be in a backend, but in LIR.h or somewhere else central
  like that.

- I'd prefer to see the basic intrinsics defined using all the #ifdefs and
  compiler-specific stuff, and then the domain-specific stuff (eg. lsReg())
  layered on top of that, like I did with compressAccSet().  The PPC and
  MIPS backends use 64 bits worth of registers, so we'll need lsbSet32(),
  msbSet32(), lsbSet64(), msbSet64(); the latter two aren't needed on all
  platforms though.


For TM, it reduces SunSpider instructions counts by a factor of 0.5%, which
is certaintly better than nothing;  for 3d-raytace it's 2.0%.  Actual timing
changes are below the noise threshold.
Attachment #463412 - Flags: feedback?(nnethercote) → feedback+
Assignee: nobody → edwsmith
Severity: normal → minor
OS: Mac OS X → Windows 7
Priority: -- → P3
Target Milestone: --- → flash10.2
I would prefer an AR representation based on an ordered list of allocated regions. This would allow annotations to associated with the regions.  The current phi node prototype (see bug 575527) requires that a phi node be begin its lifetime in a register, meaning that a block may have no more phi nodes than allocatable registers of the proper class.  I have given some thought to what would be needed to relax this restriction, and find it desirable to be able to coalesce stack allocations, i.e., to have more than one LIns sharing a slot.  In this case, we would have to delay deallocation until all LIns holding the AR slot are dead, suggesting that a reference count be maintained for the AR reservation.

A more immediate issue with both the present scheme and a bitmap is that they don't handle large multi-word allocations well, e.g., stack-allocated arrays.  This may be adequate for the current use cases, but limits the applicability of NanoJIT.
Any AR changes will be in bug 584608, and if necessary, that bug could expand in scope (or factor) to accomodate phis, N-word allocations, faster lookups, and a denser data structure.  I did some prototyping on a bitmap based AR allocation; no speedup, for twice the complexity, and that was just the 1-word allocation path.

It did prove to me, however, that the only use for AR.entries actually tracking specific instructions, is in verbose mode.  Given our current design, we dont need AR->LIns* edges.
Duplicate of this bug: 514274
Blocks: 587735
Updates since the proof-of-concept:

Incorporated all the suggestions from Comment #1.

I renamed msbSet() to msbSet32() as part of adding [msb|lsb]Set[32|64], which affected the AccSet code trivially.

I used if (sizeof(RegisterMask) == 4) to choose between 32 and 64bit implementations, counting on a sane compiler to strip out the provably dead path. An alternative would be to move the definitions of lsReg() and msReg() to NativeXXX.h, after the RegisterMask typedef, allowing backends to hardcode the choice.  Given we have six backends and one more on the way, it seemed better to centralize the code and also avoid more ifdefs.

I moved the definitions of msbSet/lsbSet to nanojit.h, where other such helpers already live.  It didn't seem appropriate to keep adding to LIR.h since the helpers will now be used in several places in nanojit.

I have *not* scoured the backends to see if we could add more specialized msbSet implementations.  We could do that in Bug 587735, or here; comments welcome.

In particular, CountLeadingZeros() in NativeARM.cpp has support for MSVC and ARMCC.

RegAlloc::managed is now set in Assembler.cpp instead of each backend; six lines of code replaced by one.

prevreg() was dead after these changes.  Additionally, I hand-inlined nextreg() in the other backends, because the usage was highly specialized -- those call sites depended on nextreg being reg+1, (or reg+2) not some generic iteration.

I removed RegAlloc::countActive() since the only case was testing countActive() == 0, which is equivalent to activeMask() == 0.

I stripped the AR changes out of this patch.  See bug 584608.

evictActiveRegs() and evictSomeActiveRegs() are now nearly identical.  I didn't combine anything, under the assumption these have been separated as a jit-performance tweak.  But, I could merge them or factor them pretty easily.

micro-optimization: nearly everywhere we call evict(), we already have reg and ins handy, so we could streamline this slightly by passing both as parameters, instead of just ins.
Attachment #463412 - Attachment is obsolete: true
Attachment #466392 - Flags: review?(nnethercote)
Attachment #466392 - Flags: feedback?
Whiteboard: has-patch, PACMAN
Attachment #466392 - Flags: feedback?
Comment on attachment 466392 [details] [diff] [review]
use lsReg/msReg() for register iteration and remove nextreg/prevreg

This is a lovely patch.  I particularly like that it removes lots of
conditional tests -- this makes the code simpler (less nesting) and also
those tests aren't easily predictable.



>@@ -2311,20 +2308,18 @@
>     {
>         // generate code to restore callee saved registers
>         // @todo speed this up
>-        for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) {
>-            evictIfActive(r);
>-        }
>+        RegisterMask evict_set = _allocator.activeMask();
>+        for (Register r = lsReg(evict_set); evict_set; r = nextLsReg(evict_set, r))
>+            evict(_allocator.getActive(r));
>     }
> 
>     void Assembler::evictSomeActiveRegs(RegisterMask regs)
>     {
>         // generate code to restore callee saved registers
>         // @todo speed this up
>-        for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) {
>-            if ((rmask(r) & regs)) {
>-                evictIfActive(r);
>-            }
>-        }
>+        RegisterMask evict_set = regs & _allocator.activeMask();
>+        for (Register r = lsReg(evict_set); evict_set; r = nextLsReg(evict_set, r))
>+            evict(_allocator.getActive(r));
>     }

This can be improved, as you mentioned -- I'd change evictActiveRegs() to
just call evictSomeActiveRegs(managed).  And swap the order of the functions
to make it more likely inlining will occur :)



>+    // return the lowest numbered Register in mask
>+    inline Register lsReg(RegisterMask mask) {
>+        if (sizeof(RegisterMask) == 4)
>+            return (Register) lsbSet32(mask);
>+        else
>+            return (Register) lsbSet64(mask);
>+    }

I really hope the compiler evaluates that test at compile-time!  It should,
though.  Maybe add a comment, ie. that this is faster than it looks?


>+namespace nanojit
>+{
>+// Define msbSet32(), lsbSet32(), msbSet64(), and lsbSet64() functions using
>+// fast find-first-bit instructions intrinsics when available.
>+// The default implementations use iteration.

Nit: "default" makes it sound like that'll often be used;  I'd change it to
"fall-back".


I'm getting an assertion failure on i386.  It occurs in this code:

        for (Register r = lsReg(not_managed); not_managed; r = nextLsReg(not_man
aged, r)) {
            // A register not managed by register allocation must be
            // neither free nor active. 
            NanoAssert(!_allocator.isFree(r));
            NanoAssert(!_allocator.getActive(r));
        }

On i386, LastReg=16 but this iterates through registers 17..31.  17 happens to
be deprecated_UnknownReg, and isFree() asserts if it is passed that.  It
doesn't happen on X64 because LastReg=32 which is outside the 0..31 range
iterated over by this loop.  Here's my proposed fix:

        for (Register r = lsReg(not_managed); not_managed; r = nextLsReg(not_man
aged, r)) {
            // A register not managed by register allocation must be
            // neither free nor active. 
            if (r <= LastReg) {
                NanoAssert(!_allocator.isFree(r));
                NanoAssert(!_allocator.getActive(r));
            }
        }

...you should check all the back-ends to make sure this is valid.  I found
LastReg on ARM and Sparc to be confusing, because there were other registers
named that had higher numbers;  hopefully you understand what's happening
there.

r=me with the above things fixed, no need to resubmit for review.
Attachment #466392 - Flags: review?(nnethercote) → review+
NJ: http://hg.mozilla.org/projects/nanojit-central/rev/c7009f5cd83e

The submitted code incorporates the suggestions above.  I moved evictAllActiveRegs() to Assembler.h as a one-line, easy-to-inline function.  I reproduced the assert on i386, fixed it, checked the other backends.  In particular, FirstReg == 0 on all platforms.

I also fixed two typos in the MSVC versions of msbSet64() and lsbSet64().
Whiteboard: has-patch, PACMAN → PACMAN, fixed-in-nanojit
Blocks: 583955
http://hg.mozilla.org/mozilla-central/rev/736c6cd53541
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.