Closed
Bug 584935
Opened 15 years ago
Closed 15 years ago
Speed up register iteration loops by using BSF instruction
Categories
(Core Graveyard :: Nanojit, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
flash10.2
People
(Reporter: edwsmith, Assigned: edwsmith)
References
Details
(Whiteboard: PACMAN, fixed-in-nanojit)
Attachments
(1 file, 1 obsolete file)
27.03 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
On x86 or x64, we can use BSF to skip regions of 0s in a bitmask, to quickly iterate through register masks that satisfy various criteria. For example:
int active = ~free & managed
for (int r=bsf(active); active != 0; i = bsf(active &= ~(1<<r)))
/* do stuff with _allocator.active[r] */
The same principle ought to work in bug 584608 for quickly finding free AR entries, although I haven't tried that yet.
Preliminary benchmarking on tamarin-redux's jit-only mode (dont run code, just feed a large input set through the JIT) shows about a 3-5% speedup in TR's verify+jit time. That time would be diluted in the real world, of course.
The attached proof of concept also includes the bitmask-based AR from bug 584608.
Assignee | ||
Updated•15 years ago
|
Attachment #463412 -
Attachment is patch: true
Attachment #463412 -
Attachment mime type: application/octet-stream → text/plain
Assignee | ||
Updated•15 years ago
|
Attachment #463412 -
Flags: feedback?(nnethercote)
![]() |
||
Comment 1•15 years ago
|
||
Comment on attachment 463412 [details] [diff] [review]
proof of concept on x64
I definitely like the basic idea of the patch, hence the f+. But I have a
few things I'd like changed.
First, I'd like to see nextreg()/prevreg() completely gone, I'd like only
one way to iterate over a loop. Remaining uses:
- RegAlloc::countActive()
- Assembler::registerConsistencyCheck()
- Assembler::printRegState()
- Assembler::intersectRegisterState() -- in a comment; just delete the
whole comment line beginning with "viz."
>+ int active_set = _allocator.activeSet();
>+ for (Register r = lsReg(active_set); active_set;
>+ r = lsReg(active_set &= ~rmask(r)))
I don't like having to write the masking code in every loop. I'd prefer
something like this:
for (Register r = lsReg(active_set); active_set;
r = nextLsReg(active_set, r))
where nextLsReg() is suitably defined.
>+ AvmAssert(ins != 0);
Nit: Use NanoAssert(), please, TM doesn't define AvmAssert().
> class AR
> {
> private:
> uint32_t _highWaterMark; /* index of highest entry used since last clear() */
>+ uint32_t _used[NJ_MAX_STACK_ENTRY/(8*sizeof(uint32_t))];
>+#if defined(_DEBUG)
> LIns* _entries[ NJ_MAX_STACK_ENTRY ]; /* maps to 4B contiguous locations relative to the frame pointer.
> NB: _entries[0] is always unused */
>+#endif
I'd like a comment explaining _used, that it's a bitset implemented using
32-bit values. I'd also like a comment explaining the connection between
_used and _entries; and since they are so closely linked their names should
be more similar, maybe _usedBits and _usedInstrs?
>+ // return the lowest numbered Register in mask
>+ inline Register lsReg(RegisterMask set) {
>+ #if defined _MSC_VER
>+ DWORD tr;
>+ _BitScanForward(&tr, set);
>+ return (int) tr;
>+ #else
>+ return Register(__builtin_ffs(set) - 1);
I think this is better:
return Register(__builtin_ctz(set | 1))
The |1 is because the result is undefined if the operand is zero.
>+ #endif
>+ }
>+
>+ // return the highest numbered Register in mask
>+ inline Register msReg(RegisterMask set) {
>+ #if defined _MSC_VER
>+ DWORD tr;
>+ _BitScanReverse(&tr, set);
>+ return (int) tr;
>+ #else
>+ return Register(31 - __builtin_clz(set));
Similarly, you probably want (set | 0x80000000) here, because
__builtin_clz's result is undefined if the argument is zero.
>+ #endif
>+ }
Several problems here:
- You're assuming either MSVC or GCC; we should have a fallback for other
compilers, like is done in msbSet().
- This should not be in a backend, but in LIR.h or somewhere else central
like that.
- I'd prefer to see the basic intrinsics defined using all the #ifdefs and
compiler-specific stuff, and then the domain-specific stuff (eg. lsReg())
layered on top of that, like I did with compressAccSet(). The PPC and
MIPS backends use 64 bits worth of registers, so we'll need lsbSet32(),
msbSet32(), lsbSet64(), msbSet64(); the latter two aren't needed on all
platforms though.
For TM, it reduces SunSpider instructions counts by a factor of 0.5%, which
is certaintly better than nothing; for 3d-raytace it's 2.0%. Actual timing
changes are below the noise threshold.
Attachment #463412 -
Flags: feedback?(nnethercote) → feedback+
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → edwsmith
Assignee | ||
Updated•15 years ago
|
Severity: normal → minor
OS: Mac OS X → Windows 7
Priority: -- → P3
Target Milestone: --- → flash10.2
Comment 2•15 years ago
|
||
I would prefer an AR representation based on an ordered list of allocated regions. This would allow annotations to associated with the regions. The current phi node prototype (see bug 575527) requires that a phi node be begin its lifetime in a register, meaning that a block may have no more phi nodes than allocatable registers of the proper class. I have given some thought to what would be needed to relax this restriction, and find it desirable to be able to coalesce stack allocations, i.e., to have more than one LIns sharing a slot. In this case, we would have to delay deallocation until all LIns holding the AR slot are dead, suggesting that a reference count be maintained for the AR reservation.
A more immediate issue with both the present scheme and a bitmap is that they don't handle large multi-word allocations well, e.g., stack-allocated arrays. This may be adequate for the current use cases, but limits the applicability of NanoJIT.
Assignee | ||
Comment 3•15 years ago
|
||
Any AR changes will be in bug 584608, and if necessary, that bug could expand in scope (or factor) to accomodate phis, N-word allocations, faster lookups, and a denser data structure. I did some prototyping on a bitmap based AR allocation; no speedup, for twice the complexity, and that was just the 1-word allocation path.
It did prove to me, however, that the only use for AR.entries actually tracking specific instructions, is in verbose mode. Given our current design, we dont need AR->LIns* edges.
Assignee | ||
Comment 5•15 years ago
|
||
Updates since the proof-of-concept:
Incorporated all the suggestions from Comment #1.
I renamed msbSet() to msbSet32() as part of adding [msb|lsb]Set[32|64], which affected the AccSet code trivially.
I used if (sizeof(RegisterMask) == 4) to choose between 32 and 64bit implementations, counting on a sane compiler to strip out the provably dead path. An alternative would be to move the definitions of lsReg() and msReg() to NativeXXX.h, after the RegisterMask typedef, allowing backends to hardcode the choice. Given we have six backends and one more on the way, it seemed better to centralize the code and also avoid more ifdefs.
I moved the definitions of msbSet/lsbSet to nanojit.h, where other such helpers already live. It didn't seem appropriate to keep adding to LIR.h since the helpers will now be used in several places in nanojit.
I have *not* scoured the backends to see if we could add more specialized msbSet implementations. We could do that in Bug 587735, or here; comments welcome.
In particular, CountLeadingZeros() in NativeARM.cpp has support for MSVC and ARMCC.
RegAlloc::managed is now set in Assembler.cpp instead of each backend; six lines of code replaced by one.
prevreg() was dead after these changes. Additionally, I hand-inlined nextreg() in the other backends, because the usage was highly specialized -- those call sites depended on nextreg being reg+1, (or reg+2) not some generic iteration.
I removed RegAlloc::countActive() since the only case was testing countActive() == 0, which is equivalent to activeMask() == 0.
I stripped the AR changes out of this patch. See bug 584608.
evictActiveRegs() and evictSomeActiveRegs() are now nearly identical. I didn't combine anything, under the assumption these have been separated as a jit-performance tweak. But, I could merge them or factor them pretty easily.
micro-optimization: nearly everywhere we call evict(), we already have reg and ins handy, so we could streamline this slightly by passing both as parameters, instead of just ins.
Attachment #463412 -
Attachment is obsolete: true
Attachment #466392 -
Flags: review?(nnethercote)
Attachment #466392 -
Flags: feedback?
Assignee | ||
Updated•15 years ago
|
Whiteboard: has-patch, PACMAN
Assignee | ||
Updated•15 years ago
|
Attachment #466392 -
Flags: feedback?
![]() |
||
Comment 6•15 years ago
|
||
Comment on attachment 466392 [details] [diff] [review]
use lsReg/msReg() for register iteration and remove nextreg/prevreg
This is a lovely patch. I particularly like that it removes lots of
conditional tests -- this makes the code simpler (less nesting) and also
those tests aren't easily predictable.
>@@ -2311,20 +2308,18 @@
> {
> // generate code to restore callee saved registers
> // @todo speed this up
>- for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) {
>- evictIfActive(r);
>- }
>+ RegisterMask evict_set = _allocator.activeMask();
>+ for (Register r = lsReg(evict_set); evict_set; r = nextLsReg(evict_set, r))
>+ evict(_allocator.getActive(r));
> }
>
> void Assembler::evictSomeActiveRegs(RegisterMask regs)
> {
> // generate code to restore callee saved registers
> // @todo speed this up
>- for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) {
>- if ((rmask(r) & regs)) {
>- evictIfActive(r);
>- }
>- }
>+ RegisterMask evict_set = regs & _allocator.activeMask();
>+ for (Register r = lsReg(evict_set); evict_set; r = nextLsReg(evict_set, r))
>+ evict(_allocator.getActive(r));
> }
This can be improved, as you mentioned -- I'd change evictActiveRegs() to
just call evictSomeActiveRegs(managed). And swap the order of the functions
to make it more likely inlining will occur :)
>+ // return the lowest numbered Register in mask
>+ inline Register lsReg(RegisterMask mask) {
>+ if (sizeof(RegisterMask) == 4)
>+ return (Register) lsbSet32(mask);
>+ else
>+ return (Register) lsbSet64(mask);
>+ }
I really hope the compiler evaluates that test at compile-time! It should,
though. Maybe add a comment, ie. that this is faster than it looks?
>+namespace nanojit
>+{
>+// Define msbSet32(), lsbSet32(), msbSet64(), and lsbSet64() functions using
>+// fast find-first-bit instructions intrinsics when available.
>+// The default implementations use iteration.
Nit: "default" makes it sound like that'll often be used; I'd change it to
"fall-back".
I'm getting an assertion failure on i386. It occurs in this code:
for (Register r = lsReg(not_managed); not_managed; r = nextLsReg(not_man
aged, r)) {
// A register not managed by register allocation must be
// neither free nor active.
NanoAssert(!_allocator.isFree(r));
NanoAssert(!_allocator.getActive(r));
}
On i386, LastReg=16 but this iterates through registers 17..31. 17 happens to
be deprecated_UnknownReg, and isFree() asserts if it is passed that. It
doesn't happen on X64 because LastReg=32 which is outside the 0..31 range
iterated over by this loop. Here's my proposed fix:
for (Register r = lsReg(not_managed); not_managed; r = nextLsReg(not_man
aged, r)) {
// A register not managed by register allocation must be
// neither free nor active.
if (r <= LastReg) {
NanoAssert(!_allocator.isFree(r));
NanoAssert(!_allocator.getActive(r));
}
}
...you should check all the back-ends to make sure this is valid. I found
LastReg on ARM and Sparc to be confusing, because there were other registers
named that had higher numbers; hopefully you understand what's happening
there.
r=me with the above things fixed, no need to resubmit for review.
Attachment #466392 -
Flags: review?(nnethercote) → review+
Assignee | ||
Comment 7•15 years ago
|
||
NJ: http://hg.mozilla.org/projects/nanojit-central/rev/c7009f5cd83e
The submitted code incorporates the suggestions above. I moved evictAllActiveRegs() to Assembler.h as a one-line, easy-to-inline function. I reproduced the assert on i386, fixed it, checked the other backends. In particular, FirstReg == 0 on all platforms.
I also fixed two typos in the MSVC versions of msbSet64() and lsbSet64().
Assignee | ||
Updated•15 years ago
|
Whiteboard: has-patch, PACMAN → PACMAN, fixed-in-nanojit
Comment 8•15 years ago
|
||
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•