584935 - Speed up register iteration loops by using BSF instruction

Assignee

Description

•

15 years ago

Attached patch proof of concept on x64 (obsolete) — Details — Splinter Review

On x86 or x64, we can use BSF to skip regions of 0s in a bitmask, to quickly iterate through register masks that satisfy various criteria. For example: int active = ~free & managed for (int r=bsf(active); active != 0; i = bsf(active &= ~(1<<r))) /* do stuff with _allocator.active[r] */ The same principle ought to work in bug 584608 for quickly finding free AR entries, although I haven't tried that yet. Preliminary benchmarking on tamarin-redux's jit-only mode (dont run code, just feed a large input set through the JIT) shows about a 3-5% speedup in TR's verify+jit time. That time would be diluted in the real world, of course. The attached proof of concept also includes the bitmask-based AR from bug 584608.

Edwin Smith

Assignee

Updated

•

15 years ago

Attachment #463412 - Attachment is patch: true

Attachment #463412 - Attachment mime type: application/octet-stream → text/plain

Edwin Smith

Assignee

Updated

•

15 years ago

Attachment #463412 - Flags: feedback?(nnethercote)

Nicholas Nethercote [inactive]

Comment 1

•

15 years ago

Comment on attachment 463412 [details] [diff] [review] proof of concept on x64 I definitely like the basic idea of the patch, hence the f+. But I have a few things I'd like changed. First, I'd like to see nextreg()/prevreg() completely gone, I'd like only one way to iterate over a loop. Remaining uses: - RegAlloc::countActive() - Assembler::registerConsistencyCheck() - Assembler::printRegState() - Assembler::intersectRegisterState() -- in a comment; just delete the whole comment line beginning with "viz." >+ int active_set = _allocator.activeSet(); >+ for (Register r = lsReg(active_set); active_set; >+ r = lsReg(active_set &= ~rmask(r))) I don't like having to write the masking code in every loop. I'd prefer something like this: for (Register r = lsReg(active_set); active_set; r = nextLsReg(active_set, r)) where nextLsReg() is suitably defined. >+ AvmAssert(ins != 0); Nit: Use NanoAssert(), please, TM doesn't define AvmAssert(). > class AR > { > private: > uint32_t _highWaterMark; /* index of highest entry used since last clear() */ >+ uint32_t _used[NJ_MAX_STACK_ENTRY/(8*sizeof(uint32_t))]; >+#if defined(_DEBUG) > LIns* _entries[ NJ_MAX_STACK_ENTRY ]; /* maps to 4B contiguous locations relative to the frame pointer. > NB: _entries[0] is always unused */ >+#endif I'd like a comment explaining _used, that it's a bitset implemented using 32-bit values. I'd also like a comment explaining the connection between _used and _entries; and since they are so closely linked their names should be more similar, maybe _usedBits and _usedInstrs? >+ // return the lowest numbered Register in mask >+ inline Register lsReg(RegisterMask set) { >+ #if defined _MSC_VER >+ DWORD tr; >+ _BitScanForward(&tr, set); >+ return (int) tr; >+ #else >+ return Register(__builtin_ffs(set) - 1); I think this is better: return Register(__builtin_ctz(set | 1)) The |1 is because the result is undefined if the operand is zero. >+ #endif >+ } >+ >+ // return the highest numbered Register in mask >+ inline Register msReg(RegisterMask set) { >+ #if defined _MSC_VER >+ DWORD tr; >+ _BitScanReverse(&tr, set); >+ return (int) tr; >+ #else >+ return Register(31 - __builtin_clz(set)); Similarly, you probably want (set | 0x80000000) here, because __builtin_clz's result is undefined if the argument is zero. >+ #endif >+ } Several problems here: - You're assuming either MSVC or GCC; we should have a fallback for other compilers, like is done in msbSet(). - This should not be in a backend, but in LIR.h or somewhere else central like that. - I'd prefer to see the basic intrinsics defined using all the #ifdefs and compiler-specific stuff, and then the domain-specific stuff (eg. lsReg()) layered on top of that, like I did with compressAccSet(). The PPC and MIPS backends use 64 bits worth of registers, so we'll need lsbSet32(), msbSet32(), lsbSet64(), msbSet64(); the latter two aren't needed on all platforms though. For TM, it reduces SunSpider instructions counts by a factor of 0.5%, which is certaintly better than nothing; for 3d-raytace it's 2.0%. Actual timing changes are below the noise threshold.

Attachment #463412 - Flags: feedback?(nnethercote) → feedback+

Edwin Smith

Assignee

Updated

•

15 years ago

Assignee: nobody → edwsmith

Edwin Smith

Assignee

Updated

•

15 years ago

Severity: normal → minor

OS: Mac OS X → Windows 7

Priority: -- → P3

Target Milestone: --- → flash10.2

William Maddox

Comment 2

•

15 years ago

I would prefer an AR representation based on an ordered list of allocated regions. This would allow annotations to associated with the regions. The current phi node prototype (see bug 575527) requires that a phi node be begin its lifetime in a register, meaning that a block may have no more phi nodes than allocatable registers of the proper class. I have given some thought to what would be needed to relax this restriction, and find it desirable to be able to coalesce stack allocations, i.e., to have more than one LIns sharing a slot. In this case, we would have to delay deallocation until all LIns holding the AR slot are dead, suggesting that a reference count be maintained for the AR reservation. A more immediate issue with both the present scheme and a bitmap is that they don't handle large multi-word allocations well, e.g., stack-allocated arrays. This may be adequate for the current use cases, but limits the applicability of NanoJIT.

Edwin Smith

Assignee

Comment 3

•

15 years ago

Any AR changes will be in bug 584608, and if necessary, that bug could expand in scope (or factor) to accomodate phis, N-word allocations, faster lookups, and a denser data structure. I did some prototyping on a bitmap based AR allocation; no speedup, for twice the complexity, and that was just the 1-word allocation path. It did prove to me, however, that the only use for AR.entries actually tracking specific instructions, is in verbose mode. Given our current design, we dont need AR->LIns* edges.

Edwin Smith

Assignee

Updated

•

15 years ago

Blocks: 587735

Edwin Smith

Assignee

Comment 5

•

15 years ago

Attached patch use lsReg/msReg() for register iteration and remove nextreg/prevreg — Details — Splinter Review

Updates since the proof-of-concept: Incorporated all the suggestions from Comment #1. I renamed msbSet() to msbSet32() as part of adding [msb|lsb]Set[32|64], which affected the AccSet code trivially. I used if (sizeof(RegisterMask) == 4) to choose between 32 and 64bit implementations, counting on a sane compiler to strip out the provably dead path. An alternative would be to move the definitions of lsReg() and msReg() to NativeXXX.h, after the RegisterMask typedef, allowing backends to hardcode the choice. Given we have six backends and one more on the way, it seemed better to centralize the code and also avoid more ifdefs. I moved the definitions of msbSet/lsbSet to nanojit.h, where other such helpers already live. It didn't seem appropriate to keep adding to LIR.h since the helpers will now be used in several places in nanojit. I have *not* scoured the backends to see if we could add more specialized msbSet implementations. We could do that in Bug 587735, or here; comments welcome. In particular, CountLeadingZeros() in NativeARM.cpp has support for MSVC and ARMCC. RegAlloc::managed is now set in Assembler.cpp instead of each backend; six lines of code replaced by one. prevreg() was dead after these changes. Additionally, I hand-inlined nextreg() in the other backends, because the usage was highly specialized -- those call sites depended on nextreg being reg+1, (or reg+2) not some generic iteration. I removed RegAlloc::countActive() since the only case was testing countActive() == 0, which is equivalent to activeMask() == 0. I stripped the AR changes out of this patch. See bug 584608. evictActiveRegs() and evictSomeActiveRegs() are now nearly identical. I didn't combine anything, under the assumption these have been separated as a jit-performance tweak. But, I could merge them or factor them pretty easily. micro-optimization: nearly everywhere we call evict(), we already have reg and ins handy, so we could streamline this slightly by passing both as parameters, instead of just ins.

Attachment #463412 - Attachment is obsolete: true

Attachment #466392 - Flags: review?(nnethercote)

Attachment #466392 - Flags: feedback?

Edwin Smith

Assignee

Updated

•

15 years ago

Whiteboard: has-patch, PACMAN

Edwin Smith

Assignee

Updated

•

15 years ago

Attachment #466392 - Flags: feedback?

Nicholas Nethercote [inactive]

Comment 6

•

15 years ago

Comment on attachment 466392 [details] [diff] [review] use lsReg/msReg() for register iteration and remove nextreg/prevreg This is a lovely patch. I particularly like that it removes lots of conditional tests -- this makes the code simpler (less nesting) and also those tests aren't easily predictable. >@@ -2311,20 +2308,18 @@ > { > // generate code to restore callee saved registers > // @todo speed this up >- for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) { >- evictIfActive(r); >- } >+ RegisterMask evict_set = _allocator.activeMask(); >+ for (Register r = lsReg(evict_set); evict_set; r = nextLsReg(evict_set, r)) >+ evict(_allocator.getActive(r)); > } > > void Assembler::evictSomeActiveRegs(RegisterMask regs) > { > // generate code to restore callee saved registers > // @todo speed this up >- for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) { >- if ((rmask(r) & regs)) { >- evictIfActive(r); >- } >- } >+ RegisterMask evict_set = regs & _allocator.activeMask(); >+ for (Register r = lsReg(evict_set); evict_set; r = nextLsReg(evict_set, r)) >+ evict(_allocator.getActive(r)); > } This can be improved, as you mentioned -- I'd change evictActiveRegs() to just call evictSomeActiveRegs(managed). And swap the order of the functions to make it more likely inlining will occur :) >+ // return the lowest numbered Register in mask >+ inline Register lsReg(RegisterMask mask) { >+ if (sizeof(RegisterMask) == 4) >+ return (Register) lsbSet32(mask); >+ else >+ return (Register) lsbSet64(mask); >+ } I really hope the compiler evaluates that test at compile-time! It should, though. Maybe add a comment, ie. that this is faster than it looks? >+namespace nanojit >+{ >+// Define msbSet32(), lsbSet32(), msbSet64(), and lsbSet64() functions using >+// fast find-first-bit instructions intrinsics when available. >+// The default implementations use iteration. Nit: "default" makes it sound like that'll often be used; I'd change it to "fall-back". I'm getting an assertion failure on i386. It occurs in this code: for (Register r = lsReg(not_managed); not_managed; r = nextLsReg(not_man aged, r)) { // A register not managed by register allocation must be // neither free nor active. NanoAssert(!_allocator.isFree(r)); NanoAssert(!_allocator.getActive(r)); } On i386, LastReg=16 but this iterates through registers 17..31. 17 happens to be deprecated_UnknownReg, and isFree() asserts if it is passed that. It doesn't happen on X64 because LastReg=32 which is outside the 0..31 range iterated over by this loop. Here's my proposed fix: for (Register r = lsReg(not_managed); not_managed; r = nextLsReg(not_man aged, r)) { // A register not managed by register allocation must be // neither free nor active. if (r <= LastReg) { NanoAssert(!_allocator.isFree(r)); NanoAssert(!_allocator.getActive(r)); } } ...you should check all the back-ends to make sure this is valid. I found LastReg on ARM and Sparc to be confusing, because there were other registers named that had higher numbers; hopefully you understand what's happening there. r=me with the above things fixed, no need to resubmit for review.

Attachment #466392 - Flags: review?(nnethercote) → review+

Edwin Smith

Assignee

Comment 7

•

15 years ago

NJ: http://hg.mozilla.org/projects/nanojit-central/rev/c7009f5cd83e The submitted code incorporates the suggestions above. I moved evictAllActiveRegs() to Assembler.h as a one-line, easy-to-inline function. I reproduced the assert on i386, fixed it, checked the other backends. In particular, FirstReg == 0 on all platforms. I also fixed two typos in the MSVC versions of msbSet64() and lsbSet64().

Edwin Smith

Assignee

Updated

•

15 years ago

Whiteboard: has-patch, PACMAN → PACMAN, fixed-in-nanojit

Edwin Smith

Assignee

Updated

•

15 years ago

Blocks: 583955

Robert Sayre

Comment 8

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/736c6cd53541

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: Core → Core Graveyard

proof of concept on x64 15 years ago Edwin Smith 20.13 KB, patch	n.nethercote : feedback+	Details \| Diff \| Splinter Review
use lsReg/msReg() for register iteration and remove nextreg/prevreg 15 years ago Edwin Smith 27.03 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review

Bugzilla

Speed up register iteration loops by using BSF instruction

Categories

(Core Graveyard :: Nanojit, defect, P3)

Tracking

(Not tracked)

People

(Reporter: edwsmith, Assigned: edwsmith)

References

Details

(Whiteboard: PACMAN, fixed-in-nanojit)

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Updated

Updated

Comment 1

Updated

Updated

Comment 2

Comment 3

Updated

Comment 5

Updated

Updated

Comment 6

Comment 7

Updated

Updated

Comment 8

Updated

Attachment

General

Description

File Name

Content Type