Closed Bug 600414 Opened 15 years ago Closed 15 years ago

TM: v8-crypto calls js_DoubleToInt32 too often because multiply returns doubles too often

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: billm, Assigned: billm)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 1 obsolete file)

This is part two of Bug 600016. The problem is that the tracer is treating some variables as doubles when integers would suffice. It happens when v8-crypto does integer multiplications that result in zero. To avoid problems with -0, the tracer guards against multiplies resulting in zero and switches to a double path.
Attached patch patch (obsolete) — Splinter Review
This patch fixes the problem. It uses a stronger test to ensure that we only go off trace if the result of the multiple is zero *and* if one of the operands is negative. I tried a number of variations on the exact tests to do in different circumstances. This patch is the best I could come up with. The original patch I posted in Bug 600016 used a different set of tests. However, those tests included a branch that was poorly predicted in Kraken's imaging-gaussian-blur benchmark, causing the running time to double. This patch is slightly slower on v8-crypto than the previous one, but it fixes the Kraken problem (and even makes Gaussian blur faster than before, rather then 2x slower). There don't appear to be any other performance effects.
Assignee: general → wmccloskey
Status: NEW → ASSIGNED
Attachment #479246 - Flags: review?(nnethercote)
Just to clarify: with the current code we see an integer-multiply that results in a positive zero, but we go off-trace, and then end up generating a new trace with a double-multiply? Is that right? It's depressing that a multiply causes up to 8 LIR instructions to be generated. And TraceRecorder::alu() is really gross. Sigh. Cachegrind results (TM+JM): --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 91.030 91.599 (0.994x) | 46.930 46.939 (------) | 3d-cube | 38.138 37.224 (1.025x) | 25.474 24.524 (1.039x) | 3d-morph | 96.837 96.911 (0.999x) | 43.077 43.078 (------) | 3d-raytrace | 33.090 33.086 (------) | 14.151 14.151 (------) | access-binary- | 102.941 102.931 (------) | 92.707 92.707 (------) | access-fannkuc | 29.011 29.025 (0.999x) | 17.177 17.177 (------) | access-nbody | 39.449 39.446 (------) | 28.180 28.180 (------) | access-nsieve | 7.414 7.412 (------) | 3.245 3.245 (------) | bitops-3bit-bi | 36.796 36.794 (------) | 32.519 32.519 (------) | bitops-bits-in | 15.837 15.836 (------) | 12.015 12.015 (------) | bitops-bitwise | 43.268 43.266 (------) | 37.963 37.963 (------) | bitops-nsieve- | 21.220 21.218 (------) | 17.045 17.045 (------) | controlflow-re | 83.053 83.076 (------) | 30.374 30.367 (------) | crypto-aes | 32.046 32.118 (0.998x) | 4.764 4.824 (0.988x) | crypto-md5 | 19.861 19.889 (0.999x) | 6.418 6.431 (0.998x) | crypto-sha1 | 68.999 69.629 (0.991x) | 21.855 21.873 (0.999x) | date-format-to | 70.822 70.819 (------) | 9.708 9.708 (------) | date-format-xp | 44.891 44.899 (------) | 31.065 31.065 (------) | math-cordic | 20.296 20.460 (0.992x) | 6.166 6.294 (0.980x) | math-partial-s | 22.025 22.895 (0.962x) | 13.407 14.255 (0.941x) | math-spectral- | 49.547 49.546 (------) | 34.585 34.585 (------) | regexp-dna | 30.875 30.874 (------) | 9.406 9.406 (------) | string-base64 | 86.826 86.994 (0.998x) | 24.725 24.893 (0.993x) | string-fasta | 112.613 112.584 (------) | 17.240 17.240 (------) | string-tagclou | 165.725 165.503 (1.001x) | 21.488 21.488 (------) | string-unpack- | 43.922 43.916 (------) | 8.557 8.557 (------) | string-validat ------- | 1406.546 1407.965 (0.999x) | 610.254 610.543 (------) | all --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 1797.820 1319.981 (1.362x) | 1661.044 1170.390 (1.419x) | v8-crypto | 1616.046 1613.366 (1.002x) | 1096.839 1096.817 (------) | v8-deltablue | 1110.784 1110.782 (------) | 509.578 509.578 (------) | v8-earley-boye | 1142.564 1142.751 (------) | 300.160 300.167 (------) | v8-raytrace | 906.213 906.220 (------) | 374.699 374.699 (------) | v8-regexp | 1196.863 1195.437 (1.001x) | 1165.764 1164.185 (1.001x) | v8-richards | 820.692 802.518 (1.023x) | 115.480 115.019 (1.004x) | v8-splay ------- | 8590.985 8091.059 (1.062x) | 5223.566 4730.857 (1.104x) | all --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 3307.707 3307.661 (------) | 3023.027 3023.027 (------) | ai-astar | 3091.629 3091.844 (------) | 1766.183 1766.194 (------) | audio-beat-det | 1518.265 1486.840 (1.021x) | 1344.176 1312.740 (1.024x) | audio-dft | 3038.195 3038.213 (------) | 1743.312 1743.357 (------) | audio-fft | 2786.270 2785.558 (------) | 1855.903 1855.888 (------) | audio-oscillat | 7792.134 7502.078 (1.039x) | 5629.416 5339.076 (1.054x) | imaging-gaussi | 3431.130 3431.297 (------) | 887.526 887.528 (------) | imaging-darkro | 6787.941 6787.950 (------) | 4734.673 4734.675 (------) | imaging-desatu | 696.571 696.357 (------) | 9.976 9.976 (------) | json-parse-fin | 500.265 499.767 (1.001x) | 6.266 6.266 (------) | json-stringify | 1552.147 1533.904 (1.012x) | 737.392 736.336 (1.001x) | stanford-crypt | 791.470 829.318 (0.954x) | 373.539 374.598 (0.997x) | stanford-crypt | 1901.987 1901.880 (------) | 1170.428 1170.429 (------) | stanford-crypt | 566.638 566.621 (------) | 233.826 233.827 (------) | stanford-crypt ------- | 37762.355 37459.293 (1.008x) | 23515.649 23193.923 (1.014x) | all (Kraken numbers include the loading of the data file.) Nice numbers!
Attachment #479246 - Flags: review?(nnethercote) → review+
(In reply to comment #2) > Just to clarify: with the current code we see an integer-multiply that results > in a positive zero, but we go off-trace, and then end up generating a new trace > with a double-multiply? Is that right? Sortof. I think we never actually get to the point of generating the integer path because the multiply we see during trace recording results in positive zero. The switch statement at the top of alu() detects this and generates a double multiply. There may be others where we fall off trace; I only stepped through one of them.
Attached patch updated patchSplinter Review
Unfortunately, I had to make some changes to this patch. It turns out it was causing the Gaussian blur problem in Bug 580468. The code that got emitted there was sub-optimal. To justify why I think this new patch is improved, here is a list of the approaches I've tried for doing the negative zero check. 1. First I tried the following code: if result == 0 { if either operand is negative, leave trace } This works well as long as the result is rarely zero. However, if the result is sometimes zero and sometimes nonzero, then the branch becomes unpredictable and performance becomes terrible on modern CPUs. 2. Then I just tried the strongest possible test: if result == 0 and (operand1 negative or operand2 negative): leave trace This branch is better because it's very predictable. However, there are a lot more instructions executed than before, so it's a little slower on most of the benchmarks. 3. Then I tried a sort of band-aid. I checked if the result was zero during recording. If that happened, then I just emitted a simpler check: if either operand is negative, leave trace This worked well in testing, because all the benchmarks where zero results occurred also had the property that a zero occurred during recording. However, changing HOTLOOP messed this up, since different values are seen during recording. This is what happened to Gaussian blur. 4. This patch tries to be more robust. It uses a weak test initially (it only checks if either operand is negative). If this check ever fails, we go off trace with a new exitType, MUL_ZERO_EXIT. This causes a recompilation of the trace using the stronger check (checking if the result is zero and one of the operands is negative). The oracle is used to decide which check to emit. Anyway, it seems to perform well in testing and I think it's even a little easier to understand. However, it does require the additional complexity of a new exitType and oracle.
Attachment #479246 - Attachment is obsolete: true
Attachment #479979 - Flags: review?(nnethercote)
Comment on attachment 479979 [details] [diff] [review] updated patch First of all, I appreciate your measurement thoroughness with this bug, and the nice explanation of what you tried in comment 4. >+ case MUL_ZERO_EXIT: > case OVERFLOW_EXIT: >- traceMonitor->oracle->markInstructionUndemotable(cx->regs->pc); >+ if (lr->exitType == MUL_ZERO_EXIT) >+ traceMonitor->oracle->markInstructionSlowZeroTest(cx->regs->pc); >+ else >+ traceMonitor->oracle->markInstructionUndemotable(cx->regs->pc); > /* FALL THROUGH */ > case BRANCH_EXIT: > case CASE_EXIT: > /* Abort recording the outer tree, extend the inner tree. */ > if (AbortRecording(cx, "Inner tree is trying to grow, " > "abort outer recording") == JIT_RESET) { > return ARECORD_ABORTED; > } Better than this complicated falling through would be to move the common tail code to the end of a function, give it a label, and use gotos in the cases. >+ >+ /* >+ * Make sure we don't lose a -0. We exit if the result is zero and if >+ * either operand is negative. We start out using a weaker guard, checking >+ * if either argument is negative. If this ever fails, we recompile with >+ * a stronger, but slower, guard. >+ */ I've reworked the comments for this code below... I think it's clearer because (a) it's clearer which test is which; (b) it doesn't use unclear "weaker" and "strong" descriptions; (c) it's clearer exactly what each test does; (d) it makes it clearer the conditions under which each test is used. Feel free to improve it further if you can see how to, or correct any mistakes I've made. Sorry for being picky, I can see myself scratching my head over this code in 6 months if the comment isn't super-clear. /* * Make sure we don't lose a -0. If either operand is * negative at record-time, or the oracle tells us to, we use * a slower test that that won't exit unless absolutely * necessary. Otherwise, we use a faster test that may exit * unnecessarily (in which case the MUL_ZERO_EXIT condition * will tell the oracle to use the slow test if we recompile). */ >+ if (v0 < 0.0 || v1 < 0.0 >+ || !oracle || oracle->isInstructionSlowZeroTest(cx->regs->pc)) >+ { >+ if (!exit) >+ exit = snapshot(OVERFLOW_EXIT); >+ /* * Slower test, won't exit unnecessarily: * exit if (result==0 && (d0 < 0 || d1 < 0)) */ >+ guard(false, >+ lir->ins2(LIR_andi, >+ lir->insEqI_0(result), >+ lir->ins2(LIR_ori, >+ lir->ins2ImmI(LIR_lti, d0, 0), >+ lir->ins2ImmI(LIR_lti, d1, 0))), >+ exit); >+ } else { /* * Faster test, may exit unnecessarily: * exit if (d0 < 0) * exit if (d1 < 0) */ >+ guardNonNeg(d0, d1, snapshot(MUL_ZERO_EXIT)); >+ } BTW, are you sure you want the "if (v0 < 0.0 || v1 < 0.0" part in there? I ask because your description of this patch in comment 4 doesn't mention that part. r=me with the above changes. No need to re-review.
Attachment #479979 - Flags: review?(nnethercote) → review+
(In reply to comment #5) Thanks very much. I like the comments The |v0 < 0.0 || v1 < 0.0| part is there for two reasons. First, it seems silly to emit the NonNeg guard if it is violated during the recording run. Second, the NonNeg guard asserts that its operands are not negative constants. It does this since that would cause the guard to always exit, which would cause a NanoJIT assertion (or maybe it won't after bug 601009?). The |v0 < 0.0 || v1 < 0.0| check ensures that the NonNeg assertions always succeed.
(In reply to comment #2) "And TraceRecorder::alu() is really gross. Sigh." Bug filed?
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: