Closed Bug 474443 Opened 17 years ago Closed 16 years ago

TM: Implement oracle-based speculative fmod/fdiv/fmul demotion

Categories

(Core :: JavaScript Engine, enhancement)

x86
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dancol, Assigned: gal)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 14 obsolete files)

36.38 KB, patch
Details | Diff | Splinter Review
37.51 KB, patch
dvander
: review+
Details | Diff | Splinter Review
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.5) Gecko/2008121622 Fedora/3.0.5-1.fc9 Firefox/3.0.5 Build Identifier: Right now, tracemonkey generates a call to js_imod. For example, for(var i = 0; i < 3; ++i) { 5 % i } Generates the following assembly: mov -4(ebp),ecx ecx(state) mov eax,ecx ecx(state) mov edi,0(eax) eax(state) mov ebx,12(eax) eax(state) edi(sp) mov -8(ebp),ebx eax(state) ebx(cx) edi(sp) mov esi,8(eax) eax(state) edi(sp) mov ebx,648(esi) esi(gp) edi(sp) mov 0(edi),5 ebx(ld1) esi(gp) edi(sp) mov 8(edi),ebx ebx(ld1) esi(gp) edi(sp) mov edx,ebx ebx(ld1) esi(gp) edi(sp) mov ecx,5 ebx(ld1) esi(gp) edi(sp) call js_imod ebx(ld1) esi(gp) edi(sp) mov edx,eax eax(js_imod1) ebx(ld1) esi(gp) edi(sp) cmp edx,-1 edx(js_imod1) ebx(ld1) esi(gp) edi(sp) je 0x113f85 edx(js_imod1) ebx(ld1) esi(gp) edi(sp) However, gcc generates the following code for int foo(int a, int b) { return a % b; } 8048480: 55 push %ebp 8048481: 89 e5 mov %esp,%ebp 8048483: 8b 55 08 mov 0x8(%ebp),%edx 8048486: 89 d0 mov %edx,%eax 8048488: c1 fa 1f sar $0x1f,%edx 804848b: f7 7d 0c idivl 0xc(%ebp) 804848e: 5d pop %ebp 804848f: 89 d0 mov %edx,%eax 8048491: c3 ret This gives the same result (for positive and negative values of both operands) as the Javascript modulus. Tracemonkey should use idiv. Reproducible: Always
Summary: Modulus could be better optimized → TM: Modulus could be better optimized
Version: unspecified → Trunk
This may depend on an existing bug to beef up the oracle. /be
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hmm? As far as I can tell, this is about our translation of div and mod. Like ... bug 455496 ? I don't think it relates to the oracle.
It does relate to the oracle. What we want to do when we compile % and / is to check whether both operands and the result are integer, and in that case emit an integer modulo/division with an overflow check, _except_ if the oracle says this speculation failed before. We wire the overflow exit to trigger to notify the oracle about failed speculations. Note that even modulo is rather tricky with JS, i.e. (5 % -3) and such.
I don't think (a % b) can overflow if a and b are integers. The only case we need to check for that gcc doesn't is where b == 0, right?
In this patch I seem to be going to extreme length to avoid negative b, but for the life of me I can't remember why I did that. https://bugzilla.mozilla.org/attachment.cgi?id=340296
This is probably a measurable perf impact for array-index heavy code (which tends to use / and % with integer indexes).
Flags: wanted1.9.1?
Attached patch patch (obsolete) — Splinter Review
This patch implements LIR_div and LIR_mod. LIR_div is a signed integer division. LIR_mod(LIR_div(a,b)) returns the modulo of a and b. On x86, this merely retrieves it from EDX. Other (RISC) architectures can peek a and b and actually calculate it. The patch currently only demotes for LIR_mod. We speculate that the rhs is never 0. We also speculate that the result is not -0. Both conditions are checked in the generated code and force an OVERFLOW_EXIT, which we should handle using the oracle. Its probably rare enough though to make sense without the oracle, too. mov eax,5 ecx(state) esi(gp) edi(sp) mov 0(edi),5 eax(5) ecx(state) esi(gp) edi(sp) mov ebx,656(esi) eax(5) ecx(state) esi(gp) edi(sp) mov 8(edi),ebx eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp) test ebx,ebx eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp) je 0x2b8fd0 eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp) cdq eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp) idiv edx:eax, ebx eax(5) ecx(state) ebx(ld2) esi(gp) edi(sp) mov -4(ebp),eax eax(div1) ecx(state) ebx(ld2) esi(gp) edi(sp) mov eax,edx ecx(state) edx(mod1) ebx(ld2) esi(gp) edi(sp) mov edx,-4(ebp) ecx(state) ebx(ld2) esi(gp) edi(sp) test eax,eax eax(mod1) ecx(state) ebx(ld2) esi(gp) edi(sp) jne 0x256fcc eax(mod1) ecx(state) ebx(ld2) esi(gp) edi(sp) cmp eax,0 eax(mod1) ecx(state) ebx(ld2) esi(gp) edi(sp) jl 0x2b8fdc ecx(state) ebx(ld2) esi(gp) edi(sp) Not as pretty as gcc and a bunch of register memory moves, but still much better than calling the builtin and a lot better than float math. Passes trace-tests, but I don't recommend for 3.1. Good basis for the oracle-based speculation though. This patch needs _extensive_ review since it mucks with LIRopcode.tbl (ran out of opcode space). At least danderson and Waldo should take a look, plus ed.
Summary: TM: Modulus could be better optimized → TM: Implement oracle-based speculative fmod/fdiv/fmul demotion
Attached patch refreshed for TM tip (obsolete) — Splinter Review
Assignee: general → gal
Attachment #359498 - Attachment is obsolete: true
The attached patch demotes for / % * + - and unary - (neg).
Attachment #364787 - Attachment is obsolete: true
Attached patch patch, passes trace-tests (obsolete) — Splinter Review
Attachment #364795 - Attachment is obsolete: true
Attached patch freshed patch (obsolete) — Splinter Review
Attachment #364825 - Attachment is obsolete: true
Attached patch refreshed patch (obsolete) — Splinter Review
Attachment #365791 - Attachment is obsolete: true
Comment on attachment 367547 [details] [diff] [review] refreshed patch Asking for review. Still needs additional work for ARM and other architectures.
Attachment #367547 - Flags: review?(graydon)
Comment on attachment 367547 [details] [diff] [review] refreshed patch >+ >+ /* >+ * As long the modulus is zero, the result is an integer. >+ */ >+ guard(true, lir->ins_eq0(lir->ins1(LIR_mod, result)), exit); >+ break; ... "As long *as* the modulus is zero" >+ case LIR_fmod: >+ if (d0->isconst() && d1->isconst()) >+ return lir->ins1(LIR_i2f, lir->insImm(jsint(r))); >+ >+ /* >+ * Make sure we don't trigger division by zero at runtime. >+ */ >+ guard(false, lir->ins_eq0(d1), exit = snapshot(OVERFLOW_EXIT)); >+ result = lir->ins1(v = LIR_mod, lir->ins2(LIR_div, d0, d1)); >+ >+ /* >+ * If the result is not 0, it is always within the integer domain. >+ */ >+ branch = lir->insBranch(LIR_jf, lir->ins_eq0(result), NULL); >+ >+ /* >+ * If the result is zero, we must exit if the lhs is negative since >+ * the result is -0 in this case, which is not in the integer domain. >+ */ >+ guard(false, lir->ins2i(LIR_lt, result, 0), exit); >+ branch->target(lir->ins0(LIR_label)); >+ break; You're guarding on 'result' being negative. You mean to be guarding on LHS being negative (so says the comment, and so says E-262-3). So change that to guard(false, lir->ins2i(LIR_lt, lhs, 0), exit); >+ if (op == LIR_div) { >+#ifdef NANOJIT_IA32 >+ allow = 1<<EAX; >+ evict(EDX); >+#elif NANOJIT_AMD64 >+ allow = 1<<RAX; >+ evict(EDX); >+#endif I think the second case here is evict(RDX), no? Otherwise generally looks good, assuming you can cook up something for ARM and check on the x64 variant. Tested on some numerics-heavy benchmarks, I imagine?
Attachment #367547 - Flags: review?(graydon) → review+
Ok, I've done some poking around for the ARM side as you asked about division performance. Basically, when you're using VFP, most of the costs come from transferring data to and from VFP/Neon registers. Thus, if you want an integer result and you have an integer input, it will be _much_ faster to use integer divide, even though it is implemented in software. (That, I believe, is the optimization you are trying to make.) The performance of the division itself appears to be about the same, whether it's a software integer routine or a hardware double-FP call. Of course, this is all hardware-dependent, but that is likely to be the case for most ARM targets (as far as I can tell). The other important thing to note is that there are some very fast FP operations available to Neon, but they don't meet IEEE754's precision requirements and so wouldn't be properly JavaScript-compatible. There may, however, be circumstances where they could be beneficial. Finally, the ARM division routines take a variable amount of time depending on the input data. As a result, the performance different will depend on the data set. Do you know what kind of data we are likely to see?
Attached patch refreshed against tip (obsolete) — Splinter Review
Carrying forward graydon's review.
Attachment #367547 - Attachment is obsolete: true
Attachment #376132 - Flags: review+
Jacob, could you do the arm part for this? I would like to land it. This could help ARM significantly.
This is a 10ms speedup on x86 (8% on crypto-aes), so I will go ahead and land this as soon ARM has caught up.
(In reply to comment #19) > Jacob, could you do the arm part for this? I would like to land it. This could > help ARM significantly. Yep, I'll get on it.
I don't think there is time or need for this for 3.5, so wanted 1.9.2
Flags: wanted1.9.1? → wanted1.9.2?
@22: This bug is important for our math-heavy application. If the fix is not in 3.5, we'll have to wait another release cycle to see included, and those can be very long indeed. The patch itself has been reviewed and tested, at least in x86; other platforms can continue using the generic implementation, which works fine. Given that the patch is here, works, and improves performance noticeably, can you please reconsider including it in 3.5?
I will set it back to wanted 1.9.1 and discuss it with the drivers. Since it didn't make the last beta, I am pretty worried about the extra risk. Do you have math heavy benchmarks that speed up significantly from this? In my internal testing I see about 1-2% on our benchmark set.
Flags: wanted1.9.2? → wanted1.9.1?
Never mind that --- After rerunning our geometry testcases against the latest code, the performance difference has disappeared (or at least slipped below easy detection.)
Great. Thats good to hear. I expect this patch to mostly benefit ARM and potentially sparc. For mul/div the x86 FPU is really fast, and so is the float to int conversion path if mul/div/mod is used in array indexes.
Important notes: * With or without my additions, the existing patch causes weird failures in trace-tests.js that don't seem to be directly related to divisions (though I haven't looked into this). Does the x86 version do the same? * I'm not entirely convinced that I've done things correctly or optimally. - The imod function could be emitted as simple instructions (by NativeARM.cpp), but it's not clear how I should get the result of a previous instruction (which is what I need to do). For example, I can easily do ins->oprnd1(), but there's no ins->result(). I'm probably missing something obvious. - I'm guessing that I can pass insCall a pointer to an instruction and it will use the result of the instruction. This seems to work. * I return -1 if the denominator is 0. Is this a good thing to do? * My changes should also work on x86 (for testing) as they aren't really ARM-specific. * My patch does not pass trace-tests.js, but I don't really know why.
Attached patch refreshed against tip (obsolete) — Splinter Review
Attachment #376132 - Attachment is obsolete: true
Attached patch patch (obsolete) — Splinter Review
Disable imod/idiv generation on non-i386 platforms since it doesn't seem to help due to lack of hardware division/modulo support (ARM in particular).
Attachment #380004 - Attachment is obsolete: true
Patch is cycling through the trial server. 20-25ms speedup. I will land this and then try out more aggressive optimizations, i.e. specialize charAt to integer and specialize array reads and writes.
Jesse, could I get some fuzzing for this bug? Special attention on arithmetic expressions involving * / %. Thanks.
Looks ok after 20 minutes with jsfunfuzz and comparison fuzzers.
Thanks. I get a failure on the unit test boxes. I will have to dig into it.
After a few more minutes I got: for each (let h in [[], [], [], [], 31]) for (let i = 0; i < 3; ++i) print(i / 10); Crash [@ nanojit::Assembler::nPatchBranch]
for (let i = 0; i < 5; ++i) { var a = i; print('' + (a %= a)); } Crash in JIT code? (Called by js_MonitorLoopEdge)
Awesome Jesse. You rock. I can reproduce the crash here. That will help a lot.
--------------------------------------- end exit block 0x278324 div1 = div ld1, ld1 mod1 = mod div1 jf eq3 -> label1 cdq eax(state) ebx(cx) esi(sp) edi(ld1) idiv edx:eax, edi eax(state) ebx(cx) esi(sp) edi(ld1) mov -12(ebp),edi eax(state) ebx(cx) esi(sp) edi(div1) mov edi,edx eax(state) edx(mod1) ebx(cx) esi(sp) mov edx,-12(ebp) eax(state) ebx(cx) esi(sp) test edi,edi eax(state) ebx(cx) esi(sp) edi(mod1) jne 0x279ef8 eax(state) ebx(cx) esi(sp) edi(mod1) We think state is in eax but idiv uses and clobbers it. Something is wrong there.
Attached patch interdiff (obsolete) — Splinter Review
This fixes the crash in #35 but produces an incorrect result.
Attachment #380365 - Attachment is patch: true
Attachment #380365 - Attachment mime type: application/octet-stream → text/plain
Attached patch patch (obsolete) — Splinter Review
This fixes the crashes but #35 still produces a bad result.
Attachment #376405 - Attachment is obsolete: true
Attachment #380007 - Attachment is obsolete: true
Attachment #380365 - Attachment is obsolete: true
This patch shuffles around the register selection code to make it less ... aehm ... crazy? Could use some more fuzzing I am sure.
Attachment #380382 - Attachment is obsolete: true
Attachment #380387 - Flags: review?(nnethercote)
Cycling through the try server again. Nick, could you review the patch. We had a couple review cycles but a lot of stuff changed and the backend changes are really complex.
Blocks: 495402
Attached patch fix non-ia32 (obsolete) — Splinter Review
Attachment #380387 - Attachment is obsolete: true
Attachment #380387 - Flags: review?(nnethercote)
var t = new Boolean(true); for each (var b in [false, 1.5, 1.5, false]) { --b; for each (var c in [t, 0.1, {}, {}, {}, t, t, 0.1]) { c %= 1; } } Assertion failure: !ti->typeMap.matches(ti_other->typeMap), at ../jstracer.cpp:3800
var f = 1; for (var a = 0; a < 5; ++a) { f = (f - 1) >>> 0; print(a % f); } With this patch, I get "NaN 1 2 0 0" instead of "NaN 1 2 3 4".
Attachment #380473 - Attachment is obsolete: true
Attachment #382884 - Flags: review?
Attachment #382884 - Flags: review? → review?(dvander)
Comment on attachment 382891 [details] [diff] [review] extract CASE_EXIT fix (was patched separately) and avoid CDQ and use MOV/SAR instead (supposed to be faster on core2) >+ CDQ(); Looks like MOV/SAR didn't make it in after all. Patch seems fine anyway.
Attachment #382891 - Flags: review+
Whiteboard: fixed-in-tracemonkey
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Flags: wanted1.9.1? → wanted1.9.2+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: