Closed Bug 584275 Opened 14 years ago Closed 14 years ago

nanojit: preparation for adding many more access regions

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)

Attachments

(4 files)

NJ patch (against 48641:8cae6d5cd69c) 14 years ago Nicholas Nethercote [inactive] 9.06 KB, patch	edwsmith : review+	Details \| Diff \| Splinter Review
TM patch (against TM 48641:8cae6d5cd69c) 14 years ago Nicholas Nethercote [inactive] 4.51 KB, patch	gal : review+	Details \| Diff \| Splinter Review
TR patch (against TR 4968:b7fa3e7574f0) 14 years ago Nicholas Nethercote [inactive] 3.23 KB, patch	edwsmith : review+	Details \| Diff \| Splinter Review
Force CseFilter tables to be cleared in CseFilter constructor 14 years ago Edwin Smith 888 bytes, patch	gal : review+	Details \| Diff \| Splinter Review

Nicholas Nethercote [inactive]

Assignee

Description

•

14 years ago

Attached patch NJ patch (against 48641:8cae6d5cd69c) — Details — Splinter Review

This patch adds some optimisations to avoid slow-downs caused by using more access regions. - In CseFilter, only clear a hash table if it is non-empty. This avoids a good number of memset() calls. - Add a function msbSet() which uses machine instructions to find the most-significant bit set in a uint32_t. I haven't yet tested msbSet() on Windows but it's cribbed from jsbit.h in TM so hopefully it'll work. I'll try server it shortly. - Use msbSet() in CseFilter::insLoad() to optimise the invalidation loop, so there's only one iteration for each bit set in storesSinceLastLoad. Also use it to speed up compressAccSet(), along with the new isSingletonSet() function. Also: - Change the way embedder-specific values are passed into ValidateWriter -- use an array, so any number of values can be passed in. - Add 'disp' to checkAccSet().

Attachment #462670 - Flags: review?(edwsmith)

Nicholas Nethercote [inactive]

Assignee

Comment 1

•

14 years ago

Attached patch TM patch (against TM 48641:8cae6d5cd69c) — Details — Splinter Review

Update TM for the NJ patches.

Attachment #462671 - Flags: review?(gal)

Nicholas Nethercote [inactive]

Assignee

Updated

•

14 years ago

Blocks: 584279

Andreas Gal :gal

Updated

•

14 years ago

Attachment #462671 - Flags: review?(gal) → review+

Edwin Smith

Comment 2

•

14 years ago

cool, msbSet()'s active ingredient is the same instruction as nRegisterAllocFromSet (more or less; bsf vs bsr). I also noticed this patch's use of msbSet() would work equally well on architectures that only have an "lsb" equivalent instruction. See NativeARM.cpp for more crib-able code for ARM. It occurs to me we could also use this to speed up iterating through various register sets. A common pattern is: for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) if (/* mask expression of rmask(r) */ && something) .. it could be replaced by: RegisterMask mask = /* equivalent mask expression */ for (Register r = msbSet(mask); mask != 0; r = msbSet()) if (/* something */) ... mask &= ~rmask(r) I remember a while back you were looking at speeding up those iteration loops but i dont know if you tried something based on the BSR/BSF (or equivalent) machine instruction.

Edwin Smith

Comment 3

•

14 years ago

Comment on attachment 462670 [details] [diff] [review] NJ patch (against 48641:8cae6d5cd69c) Looks correct. It would be good to factor out the bit-scanning code from each backend, but its not something that must be done on the critical path.

Attachment #462670 - Flags: review?(edwsmith) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 4

•

14 years ago

(In reply to comment #2) > > It occurs to me we could also use this to speed up iterating through various > register sets. A common pattern is: > > for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) > if (/* mask expression of rmask(r) */ && something) > .. > > it could be replaced by: > RegisterMask mask = /* equivalent mask expression */ > for (Register r = msbSet(mask); mask != 0; r = msbSet()) > if (/* something */) > ... > mask &= ~rmask(r) > > I remember a while back you were looking at speeding up those iteration loops > but i dont know if you tried something based on the BSR/BSF (or equivalent) > machine instruction. I didn't, but it's a good idea for the ones where the pattern applies. There are still several of the form: for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) LIns* ins = _allocator.getActive(r) if (ins) { ... } Is there a fast way to do the same thing at the level of words? I guess we could have a bit-set alongside 'active' to enable bitscanning, though that's less ideal.

Nicholas Nethercote [inactive]

Assignee

Comment 5

•

14 years ago

Attached patch TR patch (against TR 4968:b7fa3e7574f0) — Details — Splinter Review

Attachment #462941 - Flags: review?(edwsmith)

Edwin Smith

Comment 6

•

14 years ago

(In reply to comment #4) > Is there a fast way to do the same thing at the level of words? I guess we > could have a bit-set alongside 'active' to enable bitscanning, though that's > less ideal. two ideas: 1. active = ~free & managed, then use bsr/bsf. One alu op to compute the active bitmask (free is managed in RegAlloc, and "managed" is constant). the loop in unionRegisterState would look like this: mask = (~current.free | ~saved.free) & managed for (... bsf-iteration of mask) 2. replace active[] with a Briggs/Torczon style sparse set http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.7319 The sparse set maintains two arrays of [MaxReg]: sparse[] and dense[]. sparse[] is indexed by register, like _active is, but contains an index in dense. add, get, and delete are O(1), and iteration is O(N) where N is the # of active entries. (vs a bitset which is O(U)). Using BSF/BSR on a 1-word bitmask ought to be just as fast though, since it could skip 1..32 bits in 1 machine cycle, making iteration O(N) instead of O(U). the sparse set starts to beat the bitmap when U/32 >> N, ie when the set is *so* big and sparse that we start losing the benefit of the 1-cycle BSF instruction and 32-bit parallel ALU ops. oh, and idea #3: use SSE instructions to slurp 2 or 4 words at a time, or more. this doesnt seem worth the bother.

Edwin Smith

Updated

•

14 years ago

Attachment #462941 - Flags: review?(edwsmith) → review+

Edwin Smith

Comment 7

•

14 years ago

I prototyped idea #1 in Bug 584935, its showing about a 5% speedup in pure jit speed (large input set of 1000's of methods, no jit-code execution) on x64. Expect that % to be diluted in the real world of course.

Robert Sayre

Comment 8

•

14 years ago

is this fixed? http://hg.mozilla.org/mozilla-central/rev/a99f99f9f614

Nicholas Nethercote [inactive]

Assignee

Comment 9

•

14 years ago

It's fixed in NJ and TM, sorry I forgot the whiteboard+links: http://hg.mozilla.org/projects/nanojit-central/rev/64ae262a0203 http://hg.mozilla.org/tracemonkey/rev/a99f99f9f614

Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey

Edwin Smith

Comment 10

•

14 years ago

Attached patch Force CseFilter tables to be cleared in CseFilter constructor — Details — Splinter Review

This snuck in because TR's implementation of Allocator::allocChunk() calls malloc(), while IIRC TM's calls calloc().

Attachment #464461 - Flags: review?(gal)

Andreas Gal :gal

Updated

•

14 years ago

Attachment #464461 - Flags: review?(gal) → review+

Edwin Smith

Comment 11

•

14 years ago

nj patches http://hg.mozilla.org/tamarin-redux/rev/5f0eb4431a4c http://hg.mozilla.org/tamarin-redux/rev/053e59c92312 http://hg.mozilla.org/tamarin-redux/rev/355dd7b3ec01 tr patch http://hg.mozilla.org/tamarin-redux/rev/4530cb245a32

Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin

Edwin Smith

Updated

•

14 years ago

Blocks: 587735

Edwin Smith

Updated

•

14 years ago

No longer blocks: 587735

Nicholas Nethercote [inactive]

Assignee

Updated

•

14 years ago

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: Core → Core Graveyard

You need to log in before you can comment on or make changes to this bug.