Closed
Bug 474443
Opened 17 years ago
Closed 16 years ago
TM: Implement oracle-based speculative fmod/fdiv/fmul demotion
Categories
(Core :: JavaScript Engine, enhancement)
Tracking
()
RESOLVED
FIXED
People
(Reporter: dancol, Assigned: gal)
References
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(2 files, 14 obsolete files)
|
36.38 KB,
patch
|
Details | Diff | Splinter Review | |
|
37.51 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.5) Gecko/2008121622 Fedora/3.0.5-1.fc9 Firefox/3.0.5
Build Identifier:
Right now, tracemonkey generates a call to js_imod. For example,
for(var i = 0; i < 3; ++i) { 5 % i }
Generates the following assembly:
mov -4(ebp),ecx ecx(state)
mov eax,ecx ecx(state)
mov edi,0(eax) eax(state)
mov ebx,12(eax) eax(state) edi(sp)
mov -8(ebp),ebx eax(state) ebx(cx) edi(sp)
mov esi,8(eax) eax(state) edi(sp)
mov ebx,648(esi) esi(gp) edi(sp)
mov 0(edi),5 ebx(ld1) esi(gp) edi(sp)
mov 8(edi),ebx ebx(ld1) esi(gp) edi(sp)
mov edx,ebx ebx(ld1) esi(gp) edi(sp)
mov ecx,5 ebx(ld1) esi(gp) edi(sp)
call js_imod ebx(ld1) esi(gp) edi(sp)
mov edx,eax eax(js_imod1) ebx(ld1) esi(gp) edi(sp)
cmp edx,-1 edx(js_imod1) ebx(ld1) esi(gp) edi(sp)
je 0x113f85 edx(js_imod1) ebx(ld1) esi(gp) edi(sp)
However, gcc generates the following code for
int foo(int a, int b) {
return a % b;
}
8048480: 55 push %ebp
8048481: 89 e5 mov %esp,%ebp
8048483: 8b 55 08 mov 0x8(%ebp),%edx
8048486: 89 d0 mov %edx,%eax
8048488: c1 fa 1f sar $0x1f,%edx
804848b: f7 7d 0c idivl 0xc(%ebp)
804848e: 5d pop %ebp
804848f: 89 d0 mov %edx,%eax
8048491: c3 ret
This gives the same result (for positive and negative values of both operands) as the Javascript modulus. Tracemonkey should use idiv.
Reproducible: Always
Updated•17 years ago
|
Summary: Modulus could be better optimized → TM: Modulus could be better optimized
Updated•17 years ago
|
Version: unspecified → Trunk
Comment 1•17 years ago
|
||
This may depend on an existing bug to beef up the oracle.
/be
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 2•17 years ago
|
||
Hmm? As far as I can tell, this is about our translation of div and mod. Like ... bug 455496 ? I don't think it relates to the oracle.
| Assignee | ||
Comment 3•17 years ago
|
||
It does relate to the oracle.
What we want to do when we compile % and / is to check whether both operands and the result are integer, and in that case emit an integer modulo/division with an overflow check, _except_ if the oracle says this speculation failed before. We wire the overflow exit to trigger to notify the oracle about failed speculations.
Note that even modulo is rather tricky with JS, i.e. (5 % -3) and such.
Comment 4•17 years ago
|
||
I don't think (a % b) can overflow if a and b are integers. The only case we need to check for that gcc doesn't is where b == 0, right?
| Assignee | ||
Comment 5•17 years ago
|
||
In this patch I seem to be going to extreme length to avoid negative b, but for the life of me I can't remember why I did that.
https://bugzilla.mozilla.org/attachment.cgi?id=340296
| Assignee | ||
Comment 6•17 years ago
|
||
This is probably a measurable perf impact for array-index heavy code (which tends to use / and % with integer indexes).
Flags: wanted1.9.1?
| Assignee | ||
Comment 7•17 years ago
|
||
This patch implements LIR_div and LIR_mod. LIR_div is a signed integer division. LIR_mod(LIR_div(a,b)) returns the modulo of a and b. On x86, this merely retrieves it from EDX. Other (RISC) architectures can peek a and b and actually calculate it. The patch currently only demotes for LIR_mod. We speculate that the rhs is never 0. We also speculate that the result is not -0. Both conditions are checked in the generated code and force an OVERFLOW_EXIT, which we should handle using the oracle. Its probably rare enough though to make sense without the oracle, too.
mov eax,5 ecx(state) esi(gp) edi(sp)
mov 0(edi),5 eax(5) ecx(state) esi(gp) edi(sp)
mov ebx,656(esi) eax(5) ecx(state) esi(gp) edi(sp)
mov 8(edi),ebx eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp)
test ebx,ebx eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp)
je 0x2b8fd0 eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp)
cdq eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp)
idiv edx:eax, ebx eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp)
mov -4(ebp),eax eax(div1) ecx(state) ebx(ld2) esi(gp) edi(sp)
mov eax,edx ecx(state) edx(mod1) ebx(ld2) esi(gp) edi(sp)
mov edx,-4(ebp) ecx(state) ebx(ld2) esi(gp) edi(sp)
test eax,eax eax(mod1) ecx(state) ebx(ld2) esi(gp) edi(sp)
jne 0x256fcc eax(mod1) ecx(state) ebx(ld2) esi(gp) edi(sp)
cmp eax,0 eax(mod1) ecx(state) ebx(ld2) esi(gp) edi(sp)
jl 0x2b8fdc ecx(state) ebx(ld2) esi(gp) edi(sp)
Not as pretty as gcc and a bunch of register memory moves, but still much better than calling the builtin and a lot better than float math.
Passes trace-tests, but I don't recommend for 3.1. Good basis for the oracle-based speculation though. This patch needs _extensive_ review since it mucks with LIRopcode.tbl (ran out of opcode space). At least danderson and Waldo should take a look, plus ed.
| Assignee | ||
Updated•17 years ago
|
Summary: TM: Modulus could be better optimized → TM: Implement oracle-based speculative fmod/fdiv/fmul demotion
| Assignee | ||
Comment 8•17 years ago
|
||
Assignee: general → gal
Attachment #359498 -
Attachment is obsolete: true
| Assignee | ||
Comment 10•17 years ago
|
||
The attached patch demotes for / % * + - and unary - (neg).
Attachment #364787 -
Attachment is obsolete: true
| Assignee | ||
Comment 12•17 years ago
|
||
Attachment #364795 -
Attachment is obsolete: true
| Assignee | ||
Comment 13•17 years ago
|
||
Attachment #364825 -
Attachment is obsolete: true
| Assignee | ||
Comment 14•17 years ago
|
||
Attachment #365791 -
Attachment is obsolete: true
| Assignee | ||
Comment 15•17 years ago
|
||
Comment on attachment 367547 [details] [diff] [review]
refreshed patch
Asking for review. Still needs additional work for ARM and other architectures.
Attachment #367547 -
Flags: review?(graydon)
Comment 16•17 years ago
|
||
Comment on attachment 367547 [details] [diff] [review]
refreshed patch
>+
>+ /*
>+ * As long the modulus is zero, the result is an integer.
>+ */
>+ guard(true, lir->ins_eq0(lir->ins1(LIR_mod, result)), exit);
>+ break;
... "As long *as* the modulus is zero"
>+ case LIR_fmod:
>+ if (d0->isconst() && d1->isconst())
>+ return lir->ins1(LIR_i2f, lir->insImm(jsint(r)));
>+
>+ /*
>+ * Make sure we don't trigger division by zero at runtime.
>+ */
>+ guard(false, lir->ins_eq0(d1), exit = snapshot(OVERFLOW_EXIT));
>+ result = lir->ins1(v = LIR_mod, lir->ins2(LIR_div, d0, d1));
>+
>+ /*
>+ * If the result is not 0, it is always within the integer domain.
>+ */
>+ branch = lir->insBranch(LIR_jf, lir->ins_eq0(result), NULL);
>+
>+ /*
>+ * If the result is zero, we must exit if the lhs is negative since
>+ * the result is -0 in this case, which is not in the integer domain.
>+ */
>+ guard(false, lir->ins2i(LIR_lt, result, 0), exit);
>+ branch->target(lir->ins0(LIR_label));
>+ break;
You're guarding on 'result' being negative. You mean to be guarding on LHS being
negative (so says the comment, and so says E-262-3). So change that to
guard(false, lir->ins2i(LIR_lt, lhs, 0), exit);
>+ if (op == LIR_div) {
>+#ifdef NANOJIT_IA32
>+ allow = 1<<EAX;
>+ evict(EDX);
>+#elif NANOJIT_AMD64
>+ allow = 1<<RAX;
>+ evict(EDX);
>+#endif
I think the second case here is evict(RDX), no?
Otherwise generally looks good, assuming you can cook up something for ARM and
check on the x64 variant. Tested on some numerics-heavy benchmarks, I imagine?
Updated•17 years ago
|
Attachment #367547 -
Flags: review?(graydon) → review+
Comment 17•17 years ago
|
||
Ok, I've done some poking around for the ARM side as you asked about division performance.
Basically, when you're using VFP, most of the costs come from transferring data to and from VFP/Neon registers. Thus, if you want an integer result and you have an integer input, it will be _much_ faster to use integer divide, even though it is implemented in software. (That, I believe, is the optimization you are trying to make.)
The performance of the division itself appears to be about the same, whether it's a software integer routine or a hardware double-FP call. Of course, this is all hardware-dependent, but that is likely to be the case for most ARM targets (as far as I can tell).
The other important thing to note is that there are some very fast FP operations available to Neon, but they don't meet IEEE754's precision requirements and so wouldn't be properly JavaScript-compatible. There may, however, be circumstances where they could be beneficial.
Finally, the ARM division routines take a variable amount of time depending on the input data. As a result, the performance different will depend on the data set. Do you know what kind of data we are likely to see?
| Assignee | ||
Comment 18•16 years ago
|
||
Carrying forward graydon's review.
Attachment #367547 -
Attachment is obsolete: true
Attachment #376132 -
Flags: review+
| Assignee | ||
Comment 19•16 years ago
|
||
Jacob, could you do the arm part for this? I would like to land it. This could help ARM significantly.
| Assignee | ||
Comment 20•16 years ago
|
||
This is a 10ms speedup on x86 (8% on crypto-aes), so I will go ahead and land this as soon ARM has caught up.
Comment 21•16 years ago
|
||
(In reply to comment #19)
> Jacob, could you do the arm part for this? I would like to land it. This could
> help ARM significantly.
Yep, I'll get on it.
| Assignee | ||
Comment 22•16 years ago
|
||
I don't think there is time or need for this for 3.5, so wanted 1.9.2
Flags: wanted1.9.1? → wanted1.9.2?
| Reporter | ||
Comment 23•16 years ago
|
||
@22: This bug is important for our math-heavy application. If the fix is not in 3.5, we'll have to wait another release cycle to see included, and those can be very long indeed. The patch itself has been reviewed and tested, at least in x86; other platforms can continue using the generic implementation, which works fine. Given that the patch is here, works, and improves performance noticeably, can you please reconsider including it in 3.5?
| Assignee | ||
Comment 24•16 years ago
|
||
I will set it back to wanted 1.9.1 and discuss it with the drivers. Since it didn't make the last beta, I am pretty worried about the extra risk. Do you have math heavy benchmarks that speed up significantly from this? In my internal testing I see about 1-2% on our benchmark set.
Flags: wanted1.9.2? → wanted1.9.1?
| Reporter | ||
Comment 25•16 years ago
|
||
Never mind that --- After rerunning our geometry testcases against the latest code, the performance difference has disappeared (or at least slipped below easy detection.)
| Assignee | ||
Comment 26•16 years ago
|
||
Great. Thats good to hear. I expect this patch to mostly benefit ARM and potentially sparc. For mul/div the x86 FPU is really fast, and so is the float to int conversion path if mul/div/mod is used in array indexes.
Comment 27•16 years ago
|
||
Important notes:
* With or without my additions, the existing patch causes weird failures in trace-tests.js that don't seem to be directly related to divisions (though I haven't looked into this). Does the x86 version do the same?
* I'm not entirely convinced that I've done things correctly or optimally.
- The imod function could be emitted as simple instructions (by NativeARM.cpp), but it's not clear how I should get the result of a previous instruction (which is what I need to do). For example, I can easily do ins->oprnd1(), but there's no ins->result(). I'm probably missing something obvious.
- I'm guessing that I can pass insCall a pointer to an instruction and it will use the result of the instruction. This seems to work.
* I return -1 if the denominator is 0. Is this a good thing to do?
* My changes should also work on x86 (for testing) as they aren't really ARM-specific.
* My patch does not pass trace-tests.js, but I don't really know why.
| Assignee | ||
Comment 28•16 years ago
|
||
Attachment #376132 -
Attachment is obsolete: true
| Assignee | ||
Comment 29•16 years ago
|
||
Disable imod/idiv generation on non-i386 platforms since it doesn't seem to help due to lack of hardware division/modulo support (ARM in particular).
Attachment #380004 -
Attachment is obsolete: true
| Assignee | ||
Comment 30•16 years ago
|
||
Patch is cycling through the trial server. 20-25ms speedup. I will land this and then try out more aggressive optimizations, i.e. specialize charAt to integer and specialize array reads and writes.
| Assignee | ||
Comment 31•16 years ago
|
||
Jesse, could I get some fuzzing for this bug? Special attention on arithmetic expressions involving * / %. Thanks.
Comment 32•16 years ago
|
||
Looks ok after 20 minutes with jsfunfuzz and comparison fuzzers.
| Assignee | ||
Comment 33•16 years ago
|
||
Thanks. I get a failure on the unit test boxes. I will have to dig into it.
Comment 34•16 years ago
|
||
After a few more minutes I got:
for each (let h in [[], [], [], [], 31])
for (let i = 0; i < 3; ++i)
print(i / 10);
Crash [@ nanojit::Assembler::nPatchBranch]
Comment 35•16 years ago
|
||
for (let i = 0; i < 5; ++i) { var a = i; print('' + (a %= a)); }
Crash in JIT code? (Called by js_MonitorLoopEdge)
| Assignee | ||
Comment 36•16 years ago
|
||
Awesome Jesse. You rock. I can reproduce the crash here. That will help a lot.
| Assignee | ||
Comment 37•16 years ago
|
||
--------------------------------------- end exit block 0x278324
div1 = div ld1, ld1
mod1 = mod div1
jf eq3 -> label1
cdq eax(state) ebx(cx) esi(sp) edi(ld1)
idiv edx:eax, edi eax(state) ebx(cx) esi(sp) edi(ld1)
mov -12(ebp),edi eax(state) ebx(cx) esi(sp) edi(div1)
mov edi,edx eax(state) edx(mod1) ebx(cx) esi(sp)
mov edx,-12(ebp) eax(state) ebx(cx) esi(sp)
test edi,edi eax(state) ebx(cx) esi(sp) edi(mod1)
jne 0x279ef8 eax(state) ebx(cx) esi(sp) edi(mod1)
We think state is in eax but idiv uses and clobbers it. Something is wrong there.
| Assignee | ||
Comment 38•16 years ago
|
||
This fixes the crash in #35 but produces an incorrect result.
| Assignee | ||
Updated•16 years ago
|
Attachment #380365 -
Attachment is patch: true
Attachment #380365 -
Attachment mime type: application/octet-stream → text/plain
| Assignee | ||
Comment 39•16 years ago
|
||
This fixes the crashes but #35 still produces a bad result.
Attachment #376405 -
Attachment is obsolete: true
Attachment #380007 -
Attachment is obsolete: true
Attachment #380365 -
Attachment is obsolete: true
| Assignee | ||
Comment 40•16 years ago
|
||
This patch shuffles around the register selection code to make it less ... aehm ... crazy? Could use some more fuzzing I am sure.
Attachment #380382 -
Attachment is obsolete: true
| Assignee | ||
Updated•16 years ago
|
Attachment #380387 -
Flags: review?(nnethercote)
| Assignee | ||
Comment 41•16 years ago
|
||
Cycling through the try server again. Nick, could you review the patch. We had a couple review cycles but a lot of stuff changed and the backend changes are really complex.
| Assignee | ||
Comment 42•16 years ago
|
||
Attachment #380387 -
Attachment is obsolete: true
Attachment #380387 -
Flags: review?(nnethercote)
Comment 43•16 years ago
|
||
var t = new Boolean(true);
for each (var b in [false, 1.5, 1.5, false]) {
--b;
for each (var c in [t, 0.1, {}, {}, {}, t, t, 0.1]) {
c %= 1;
}
}
Assertion failure: !ti->typeMap.matches(ti_other->typeMap), at ../jstracer.cpp:3800
Comment 44•16 years ago
|
||
var f = 1;
for (var a = 0; a < 5; ++a) {
f = (f - 1) >>> 0;
print(a % f);
}
With this patch, I get "NaN 1 2 0 0" instead of "NaN 1 2 3 4".
| Assignee | ||
Comment 45•16 years ago
|
||
Attachment #380473 -
Attachment is obsolete: true
Attachment #382884 -
Flags: review?
| Assignee | ||
Updated•16 years ago
|
Attachment #382884 -
Flags: review? → review?(dvander)
| Assignee | ||
Comment 46•16 years ago
|
||
Comment on attachment 382891 [details] [diff] [review]
extract CASE_EXIT fix (was patched separately) and avoid CDQ and use MOV/SAR instead (supposed to be faster on core2)
>+ CDQ();
Looks like MOV/SAR didn't make it in after all. Patch seems fine anyway.
Attachment #382891 -
Flags: review+
| Assignee | ||
Comment 48•16 years ago
|
||
Pushed to TM.
http://hg.mozilla.org/tracemonkey/rev/812a94dc7dd5
Whiteboard: fixed-in-tracemonkey
Comment 49•16 years ago
|
||
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•16 years ago
|
Flags: wanted1.9.1? → wanted1.9.2+
Updated•15 years ago
|
Attachment #382884 -
Flags: review?(dvander)
You need to log in
before you can comment on or make changes to this bug.
Description
•