Closed
Bug 584275
Opened 14 years ago
Closed 14 years ago
nanojit: preparation for adding many more access regions
Categories
(Core Graveyard :: Nanojit, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: n.nethercote, Assigned: n.nethercote)
References
Details
(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)
Attachments
(4 files)
9.06 KB,
patch
|
edwsmith
:
review+
|
Details | Diff | Splinter Review |
4.51 KB,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
3.23 KB,
patch
|
edwsmith
:
review+
|
Details | Diff | Splinter Review |
888 bytes,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
This patch adds some optimisations to avoid slow-downs caused by using more
access regions.
- In CseFilter, only clear a hash table if it is non-empty. This avoids a
good number of memset() calls.
- Add a function msbSet() which uses machine instructions to find the
most-significant bit set in a uint32_t. I haven't yet tested msbSet() on
Windows but it's cribbed from jsbit.h in TM so hopefully it'll work. I'll
try server it shortly.
- Use msbSet() in CseFilter::insLoad() to optimise the invalidation loop, so
there's only one iteration for each bit set in storesSinceLastLoad. Also
use it to speed up compressAccSet(), along with the new isSingletonSet()
function.
Also:
- Change the way embedder-specific values are passed into ValidateWriter --
use an array, so any number of values can be passed in.
- Add 'disp' to checkAccSet().
Attachment #462670 -
Flags: review?(edwsmith)
Assignee | ||
Comment 1•14 years ago
|
||
Update TM for the NJ patches.
Attachment #462671 -
Flags: review?(gal)
Updated•14 years ago
|
Attachment #462671 -
Flags: review?(gal) → review+
Comment 2•14 years ago
|
||
cool, msbSet()'s active ingredient is the same instruction as nRegisterAllocFromSet (more or less; bsf vs bsr). I also noticed this patch's use of msbSet() would work equally well on architectures that only have an "lsb" equivalent instruction. See NativeARM.cpp for more crib-able code for ARM.
It occurs to me we could also use this to speed up iterating through various register sets. A common pattern is:
for (Register r = FirstReg; r <= LastReg; r = nextreg(r))
if (/* mask expression of rmask(r) */ && something)
..
it could be replaced by:
RegisterMask mask = /* equivalent mask expression */
for (Register r = msbSet(mask); mask != 0; r = msbSet())
if (/* something */)
...
mask &= ~rmask(r)
I remember a while back you were looking at speeding up those iteration loops but i dont know if you tried something based on the BSR/BSF (or equivalent) machine instruction.
Comment 3•14 years ago
|
||
Comment on attachment 462670 [details] [diff] [review]
NJ patch (against 48641:8cae6d5cd69c)
Looks correct. It would be good to factor out the bit-scanning code from each backend, but its not something that must be done on the critical path.
Attachment #462670 -
Flags: review?(edwsmith) → review+
Assignee | ||
Comment 4•14 years ago
|
||
(In reply to comment #2)
>
> It occurs to me we could also use this to speed up iterating through various
> register sets. A common pattern is:
>
> for (Register r = FirstReg; r <= LastReg; r = nextreg(r))
> if (/* mask expression of rmask(r) */ && something)
> ..
>
> it could be replaced by:
> RegisterMask mask = /* equivalent mask expression */
> for (Register r = msbSet(mask); mask != 0; r = msbSet())
> if (/* something */)
> ...
> mask &= ~rmask(r)
>
> I remember a while back you were looking at speeding up those iteration loops
> but i dont know if you tried something based on the BSR/BSF (or equivalent)
> machine instruction.
I didn't, but it's a good idea for the ones where the pattern applies. There are still several of the form:
for (Register r = FirstReg; r <= LastReg; r = nextreg(r))
LIns* ins = _allocator.getActive(r)
if (ins) {
...
}
Is there a fast way to do the same thing at the level of words? I guess we could have a bit-set alongside 'active' to enable bitscanning, though that's less ideal.
Assignee | ||
Comment 5•14 years ago
|
||
Attachment #462941 -
Flags: review?(edwsmith)
Comment 6•14 years ago
|
||
(In reply to comment #4)
> Is there a fast way to do the same thing at the level of words? I guess we
> could have a bit-set alongside 'active' to enable bitscanning, though that's
> less ideal.
two ideas:
1. active = ~free & managed, then use bsr/bsf. One alu op to compute
the active bitmask (free is managed in RegAlloc, and "managed" is
constant). the loop in unionRegisterState would look like this:
mask = (~current.free | ~saved.free) & managed
for (... bsf-iteration of mask)
2. replace active[] with a Briggs/Torczon style sparse set
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.7319
The sparse set maintains two arrays of [MaxReg]: sparse[] and dense[].
sparse[] is indexed by register, like _active is, but contains an index
in dense. add, get, and delete are O(1), and iteration is O(N) where N
is the # of active entries. (vs a bitset which is O(U)).
Using BSF/BSR on a 1-word bitmask ought to be just as fast though, since
it could skip 1..32 bits in 1 machine cycle, making iteration O(N)
instead of O(U). the sparse set starts to beat the bitmap when U/32 >> N,
ie when the set is *so* big and sparse that we start losing the benefit
of the 1-cycle BSF instruction and 32-bit parallel ALU ops.
oh, and idea #3: use SSE instructions to slurp 2 or 4 words at a time,
or more. this doesnt seem worth the bother.
Updated•14 years ago
|
Attachment #462941 -
Flags: review?(edwsmith) → review+
Comment 7•14 years ago
|
||
I prototyped idea #1 in Bug 584935, its showing about a 5% speedup in pure jit speed (large input set of 1000's of methods, no jit-code execution) on x64. Expect that % to be diluted in the real world of course.
Comment 8•14 years ago
|
||
is this fixed?
http://hg.mozilla.org/mozilla-central/rev/a99f99f9f614
Assignee | ||
Comment 9•14 years ago
|
||
It's fixed in NJ and TM, sorry I forgot the whiteboard+links:
http://hg.mozilla.org/projects/nanojit-central/rev/64ae262a0203
http://hg.mozilla.org/tracemonkey/rev/a99f99f9f614
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey
Comment 10•14 years ago
|
||
This snuck in because TR's implementation of Allocator::allocChunk() calls malloc(), while IIRC TM's calls calloc().
Attachment #464461 -
Flags: review?(gal)
Updated•14 years ago
|
Attachment #464461 -
Flags: review?(gal) → review+
Comment 11•14 years ago
|
||
nj patches
http://hg.mozilla.org/tamarin-redux/rev/5f0eb4431a4c
http://hg.mozilla.org/tamarin-redux/rev/053e59c92312
http://hg.mozilla.org/tamarin-redux/rev/355dd7b3ec01
tr patch
http://hg.mozilla.org/tamarin-redux/rev/4530cb245a32
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin
Assignee | ||
Updated•14 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•