Closed
Bug 533546
Opened 15 years ago
Closed 15 years ago
LIR_quad emits suboptimal code for x86
Categories
(Core Graveyard :: Nanojit, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
Future
People
(Reporter: stejohns, Unassigned)
References
Details
(Whiteboard: fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)
Attachments
(1 file, 7 obsolete files)
19.01 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
Loading an arbitrary FP constant into an SSE2 register is generally done by storing two 32-bit constants in the local frame, then loading into an SSE2 reg. Unfortunately there doesn't seem to be any smarts about hoisting these out of loops, so a constant used in an inner loop does this store-and-reload each time thru the loop. There ought to be a way to hoist this out of loops (assuming adequate register supply, which is often the case)
Reporter | ||
Comment 1•15 years ago
|
||
FYI: I hacked the code so that testcase where I first noticed this (SciMark SOR) would always special-case the magic constants that we need, to avoid the redundant store everytime thru the inner loop
Reporter | ||
Comment 2•15 years ago
|
||
(trying again)
FYI: I hacked the code so that testcase where I first noticed this (SciMark SOR) would always special-case the magic constants that we need, to avoid the redundant store everytime thru the inner loop, and the overall speed rating doubled. So this is *definitely* worth solving!
Comment 3•15 years ago
|
||
the easiest fix i can think of would be having the backend create a literal pool for each method (similar to what we do for ARM) so all constants that can't be used with immediate mode instructions live in the pool, and the code only ever loads them (never stores them).
this approach would avoid having to deal with lifetime-range extension logic that would be needed for "true" hoisting.
presumably this would apply to all cpu's for double literals since none support immediate-mode 64bit doubles, and so maybe literal-pool-management has some shareable parts. (or at least ARM stops being a special case).
Updated•15 years ago
|
Component: JIT Compiler (NanoJIT) → Nanojit
Product: Tamarin → Core
QA Contact: nanojit → nanojit
Target Milestone: --- → Future
Reporter | ||
Comment 4•15 years ago
|
||
(In reply to comment #3)
> presumably this would apply to all cpu's for double literals since none support
> immediate-mode 64bit doubles, and so maybe literal-pool-management has some
> shareable parts. (or at least ARM stops being a special case).
maybe, maybe not... IIRC x64 supports 64-bit int literals for GPR registers; it may or may not be a perf win to do a load-double-from-mem vs. a load-immediate followed by a reg-reg move.
Reporter | ||
Comment 5•15 years ago
|
||
Here's a patch with one approach: it allocates quad constants from dataAlloc and emits loads for those, so no stack space is needed for LIR_quad/LIR_float.
Drawbacks:
-- the patch needed to intrude on core code a bit... partly because we need a new invariant that findMemFor() should never be called for LIR_quad/LIR_float in this mode, and partly because I suspect (without proof) that other architectures may benefit from a similar pooling as well.
-- avoids the redundant stores, but still doesn't do anything to hoist the original load out of a loop when possible.
Still, the above patch makes a substantial perf improvement on code doing lots of fp-const usage in inner loops; a variant of SciMark/SOR nearly doubled in speed with this patch in place.
Attachment #417562 -
Flags: review?(edwsmith)
Reporter | ||
Updated•15 years ago
|
Attachment #417562 -
Flags: review?(nnethercote)
Reporter | ||
Comment 6•15 years ago
|
||
Also: this patch assumes that dataAlloc always returns 8-byte aligned pointers, which I haven't taken the time to verify. (8-aligned doubles aren't required for x86, but there is a nontrivial performance penalty otherwise)
Comment 7•15 years ago
|
||
Comment on attachment 417562 [details] [diff] [review]
Patch
>diff -r 9d15d6f4868e nanojit/Nativei386.cpp
>--- a/nanojit/Nativei386.cpp Mon Dec 14 13:50:10 2009 -0800
>+++ b/nanojit/Nativei386.cpp Mon Dec 14 14:12:24 2009 -0800
>@@ -466,7 +466,9 @@
> int d = (arg - abi_regcount) * sizeof(intptr_t) + 8;
> LD(r, d, FP);
> }
>- else {
>+ else if (i->isconstq()) {
>+ asm_quad(r, i->imm64());
>+ } else {
> int d = findMemFor(i);
> asm_load(d,r);
> }
Might as well put this case just after the isconst() case.
>+ void Assembler::asm_quad(Register r, uint64_t q)
>+ {
>+ if (rmask(r) & XmmRegs) {
>+ if (q == 0) {
>+ // test (int64)0 since -0.0 == 0.0
>+ SSE_XORPDr(r, r);
>+ } else if (q == 0x3FF0000000000000LL) {
>+ // 1.0 is extremely frequent and worth special-casing!
>+ static const double k_ONE = 1.0;
>+ LDSDm(r, &k_ONE);
>+ } else {
>+ const uint64_t* p = findQuadConstant(q);
>+ LDSDm(r, (const double*)p);
>+ }
>+ } else {
>+ if (q == 0) {
>+ // test (int64)0 since -0.0 == 0.0
>+ FLDZ();
>+ } else if (q ==0x3FF0000000000000LL ) {
>+ FLD1();
>+ } else {
>+ const uint64_t* p = findQuadConstant(q);
>+ FLDQdm((const double*)p);
>+ }
>+ }
>+ }
I hate that you've written out 1.0 as 0x3FF0000000000000LL. It shouldn't be necessary now anyway, right? Just put it into the constant pool with all the others.
Can you add a NanoAssert(r == FST0) in the else branch?
You've removed the "can fit in 32bits" case -- was that deliberate? That's an important case in the old code. Would a mov/xorpd/cvtsi2sd be faster than a load from memory? To do it safely requires propagating 'canClobberCCs' into asm_quad().
>+#define FLDQdm(m) do { const double* const dm = m; count_ldq(); FPUdm(0xdd00, dm); asm_output("fldq (#%p)",dm); fpu_push();} while(0)
I don't think the '#' is necessary. I'm in the process of removing them elsewhere (bug 532566).
Comment 8•15 years ago
|
||
Here's a version that implements most of my suggestions, most notably the "can fit in 32-bits" case. I didn't see any notable difference between your version and my version but it might be worth trying my version on your tests.
BTW, I got a 3x speedup on a microbenchmark that has a quad constant in the main loop with both patches.
Is there any way to hoist things out of loops in Nanojit? Doesn't seem like it, and that's a big shortcoming.
Comment 9•15 years ago
|
||
where there's a will, there's a way, right?
here's a sketch may seem weird at first but doesn't require code motion or any new passes. (but does add one more reader to the pipeline).
1. make a LirFilter subclass for const hoisting, and inside keep a list of LIns* for constants we have seen so far.
2. read() will call in->read(). when we see a const, add it to our list and skip it, which pulls the const out of the instruction sequence we see it in, and saves it for later.
3. when we see LIR_start, switch modes, and return one saved constant. keep doing this until we're out, then return the saved LIR_start. (maybe this mode-switch needs to happen when we see the first LIR_param).
the effect is all immediates get lifted out of the code stream and stuck at the top.
In principle a reader like this could be aware of block boundaries and could read ahead, buffer instructions, reorder them, etc. It could even generate new instructions if it keeps a LirBuffer squirreled away internally for its own use. (write the new instruction to the buffer, then return a pointer to it. or heck, if traversing these manufactured instructions isn't required, we could use Allocator to new them up as needed, from an appropriately-long-lived allocator of course).
Reporter | ||
Comment 10•15 years ago
|
||
(In reply to comment #7)
> I hate that you've written out 1.0 as 0x3FF0000000000000LL.
it's ugly, but avoids a possible reload or reg-reg move for the compare... but...
> It shouldn't be
> necessary now anyway, right? Just put it into the constant pool with all the
> others.
bingo, so above is moot
> Can you add a NanoAssert(r == FST0) in the else branch?
sure, but that's the only x87 reg that the 386 backend knows about -- what else could it be? :-)
> You've removed the "can fit in 32bits" case -- was that deliberate? That's an
> important case in the old code. Would a mov/xorpd/cvtsi2sd be faster than a
> load from memory?
Good question. My (foolish, untested) assumption was "no". One way to find out.
Reporter | ||
Comment 11•15 years ago
|
||
(In reply to comment #8)
> Here's a version that implements most of my suggestions, most notably the "can
> fit in 32-bits" case. I didn't see any notable difference between your version
> and my version but it might be worth trying my version on your tests.
not a big diff, but on my SciMark test yours seems possibly faster... composite score of ~199 vs ~197 for my patch, fairly consistent across multiple runs. So I vote for yours.
should we move ahead with pushing it as-is or do we want to do more work first? (IMHO it seems pretty safe and the speed improvement is significant)
BTW: is it the case that dataAlloc always returns 8-byte-aligned pointers? If not, there's more optimization awaiting us.
Reporter | ||
Comment 12•15 years ago
|
||
Also: I wonder if x86-64 might benefit from this. (Probably much less, but who knows)
Comment 13•15 years ago
|
||
(In reply to comment #10)
> > Can you add a NanoAssert(r == FST0) in the else branch?
>
> sure, but that's the only x87 reg that the 386 backend knows about -- what else
> could it be? :-)
That's what assertions are for -- testing assumptions.
(In reply to comment #11)
>
> should we move ahead with pushing it as-is or do we want to do more work first?
> (IMHO it seems pretty safe and the speed improvement is significant)
I want the debug output to be fixed up at least. And I'm not all that happy with my approach in the "can fit in 32 bits" case -- using cvt for cases where the condition codes can be clobbered and constant pools for other cases is a bit odd.
> BTW: is it the case that dataAlloc always returns 8-byte-aligned pointers? If
> not, there's more optimization awaiting us.
I don't know. That would be good to test. It's possible that TM and TR have different dataAllocs, too.
Comment 14•15 years ago
|
||
nanojit::Allocator returns 8-aligned blocks as long as its underlying provider does. we should add an assert to check that in Allocator::allocSlow().
Reporter | ||
Comment 15•15 years ago
|
||
(In reply to comment #13)
> (In reply to comment #11)
> > should we move ahead with pushing it as-is or do we want to do more work
>
> I want the debug output to be fixed up at least. And I'm not all that happy
> with my approach in the "can fit in 32 bits" case -- using cvt for cases where
> the condition codes can be clobbered and constant pools for other cases is a
> bit odd.
OK -- do you want to nurse it thru to completion with a subsequent patch, or shall I?
Comment 16•15 years ago
|
||
I can take over if you're happy to wait a bit for it to go in -- I have a stack of things I'm trying to get in and this conflicts with some of them (eg. bug 516347)... we manage to keep treading on each other's toes.
Reporter | ||
Comment 17•15 years ago
|
||
(In reply to comment #16)
> I can take over if you're happy to wait a bit for it to go in -- I have a stack
> of things I'm trying to get in and this conflicts with some of them (eg. bug
> 516347)... we manage to keep treading on each other's toes.
Let's see who gets to it first... I have various higher-priority items on my plate at the moment so probably won't get to it for a few days (possibly till after the holidays).
Updated•15 years ago
|
Attachment #417562 -
Flags: review?(edwsmith)
Reporter | ||
Comment 18•15 years ago
|
||
(In reply to comment #14)
> nanojit::Allocator returns 8-aligned blocks as long as its underlying provider
> does. we should add an assert to check that in Allocator::allocSlow().
Adding such an assert fails massively. New bug: https://bugzilla.mozilla.org/show_bug.cgi?id=536568
Reporter | ||
Comment 19•15 years ago
|
||
Updated/rebased version of Nick's patch.
> I want the debug output to be fixed up at least.
In what way? Granted the current output is a little iffy (basically emitting a comment about a quad constant being generated at a specific address) but presumably you have something better in mind?
> And I'm not all that happy
> with my approach in the "can fit in 32 bits" case -- using cvt for cases where
> the condition codes can be clobbered and constant pools for other cases is a
> bit odd.
Yeah, it's a bit odd, but I'm not sure there's a clearly better approach for x86.
Attachment #417562 -
Attachment is obsolete: true
Attachment #417832 -
Attachment is obsolete: true
Attachment #419020 -
Flags: review?(nnethercote)
Attachment #417562 -
Flags: review?(nnethercote)
Reporter | ||
Updated•15 years ago
|
Attachment #419020 -
Flags: review?(edwsmith)
Comment 20•15 years ago
|
||
Comment on attachment 419020 [details] [diff] [review]
Updated patch
once njn is happy, i'm happy
Attachment #419020 -
Flags: review?(edwsmith) → review+
Comment 21•15 years ago
|
||
I didn't like that the quad comment got between the instruction and any "<= restore" suffix. So I just got rid of it, the constant printed after the movsd is good enough IMO. I also got rid of some unnecessary '#'s before addresses.
Attachment #419020 -
Attachment is obsolete: true
Attachment #419053 -
Flags: review?(stejohns)
Attachment #419020 -
Flags: review?(nnethercote)
Comment 22•15 years ago
|
||
attach the right patch this time
Attachment #419053 -
Attachment is obsolete: true
Attachment #419054 -
Flags: review?(stejohns)
Attachment #419053 -
Flags: review?(stejohns)
Reporter | ||
Comment 23•15 years ago
|
||
Comment on attachment 419054 [details] [diff] [review]
my patch v2
Looks good. I think we should experiment with Edwin's idea about hoisting the constants too, but that can come in a subsequent patch.
Attachment #419054 -
Flags: review?(stejohns) → review+
Reporter | ||
Comment 24•15 years ago
|
||
After-holidays ping... are we ready to land the existing patch?
Comment 25•15 years ago
|
||
Yep!
Reporter | ||
Comment 26•15 years ago
|
||
Applying Nick's patch to current NJ tip fails current lirasm --random when sse2 is disabled: asm_fop asserts since it calls findMemFor erroneously. This addes direct-memory versions of the x87 operations to deal with this case.
Attachment #419054 -
Attachment is obsolete: true
Attachment #419986 -
Flags: review?(nnethercote)
Updated•15 years ago
|
Attachment #419986 -
Flags: review?(nnethercote) → review+
Reporter | ||
Comment 27•15 years ago
|
||
FYI, this passes NJ sanities, but I am going to do futher torture-testing in Flash and TM... if those pass (with sse2 disabled) then I will commit.
Reporter | ||
Comment 28•15 years ago
|
||
asm_fcmp also wasn't happy in non-sse2 mode (lirasm --random didn't trigger it for me, but some FP-heavy code in Flash did).
Attachment #419986 -
Attachment is obsolete: true
Attachment #419997 -
Flags: review?(nnethercote)
Comment 29•15 years ago
|
||
Comment on attachment 419997 [details] [diff] [review]
nick's patch, with non-sse2 goodness (v2)
Have you looked through all the affected asm_* functions to see that the x87 case has been handled? Better to do it via inspection than by waiting for tests to trigger it...
Attachment #419997 -
Flags: review?(nnethercote) → review+
Reporter | ||
Comment 30•15 years ago
|
||
(In reply to comment #29)
> (From update of attachment 419997 [details] [diff] [review])
> Have you looked through all the affected asm_* functions to see that the x87
> case has been handled?
Yep, but I missed this one. I'm going to make a second pass. Not sure if there's a good way to restructure the code to make it harder to do incorrectly (at least not without major pain.)
Reporter | ||
Comment 31•15 years ago
|
||
Further inspection shows that it could be possible for an inappropriate call to findMemFor() in asm_arg, and also asm_qlo/asm_qhi (but only if ExprFilter isn't in use)... I can't find a way to trigger these cases in TR, TM, Flash, or lirasm, so the fixes are a bit theoretical.
Attachment #419997 -
Attachment is obsolete: true
Attachment #420004 -
Flags: review?(nnethercote)
Updated•15 years ago
|
Attachment #420004 -
Flags: review?(nnethercote) → review+
Reporter | ||
Comment 32•15 years ago
|
||
Reporter | ||
Comment 33•15 years ago
|
||
See https://bugzilla.mozilla.org/show_bug.cgi?id=538000 for an approach that might be useful for hoisting the constants.
Reporter | ||
Comment 34•15 years ago
|
||
Reporter | ||
Updated•15 years ago
|
Whiteboard: fixed-in-nanojit
Reporter | ||
Updated•15 years ago
|
Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tamarin
Comment 35•15 years ago
|
||
Whiteboard: fixed-in-nanojit, fixed-in-tamarin → fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey
Comment 36•15 years ago
|
||
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•