Closed Bug 600414 Opened 9 years ago Closed 9 years ago

TM: v8-crypto calls js_DoubleToInt32 too often because multiply returns doubles too often

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: billm, Assigned: billm)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 1 obsolete file)

This is part two of Bug 600016. The problem is that the tracer is treating some variables as doubles when integers would suffice. It happens when v8-crypto does integer multiplications that result in zero. To avoid problems with -0, the tracer guards against multiplies resulting in zero and switches to a double path.
Attached patch patch (obsolete) — Splinter Review
This patch fixes the problem. It uses a stronger test to ensure that we only go off trace if the result of the multiple is zero *and* if one of the operands is negative.

I tried a number of variations on the exact tests to do in different circumstances. This patch is the best I could come up with. The original patch I posted in Bug 600016 used a different set of tests. However, those tests included a branch that was poorly predicted in Kraken's imaging-gaussian-blur benchmark, causing the running time to double. This patch is slightly slower on v8-crypto than the previous one, but it fixes the Kraken problem (and even makes Gaussian blur faster than before, rather then 2x slower). There don't appear to be any other performance effects.
Assignee: general → wmccloskey
Status: NEW → ASSIGNED
Attachment #479246 - Flags: review?(nnethercote)
Just to clarify:  with the current code we see an integer-multiply that results in a positive zero, but we go off-trace, and then end up generating a new trace with a double-multiply?  Is that right?

It's depressing that a multiply causes up to 8 LIR instructions to be generated.  And TraceRecorder::alu() is really gross.  Sigh.

Cachegrind results (TM+JM):

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    91.030    91.599 (0.994x) |    46.930    46.939 (------) | 3d-cube
|    38.138    37.224 (1.025x) |    25.474    24.524 (1.039x) | 3d-morph
|    96.837    96.911 (0.999x) |    43.077    43.078 (------) | 3d-raytrace
|    33.090    33.086 (------) |    14.151    14.151 (------) | access-binary-
|   102.941   102.931 (------) |    92.707    92.707 (------) | access-fannkuc
|    29.011    29.025 (0.999x) |    17.177    17.177 (------) | access-nbody
|    39.449    39.446 (------) |    28.180    28.180 (------) | access-nsieve
|     7.414     7.412 (------) |     3.245     3.245 (------) | bitops-3bit-bi
|    36.796    36.794 (------) |    32.519    32.519 (------) | bitops-bits-in
|    15.837    15.836 (------) |    12.015    12.015 (------) | bitops-bitwise
|    43.268    43.266 (------) |    37.963    37.963 (------) | bitops-nsieve-
|    21.220    21.218 (------) |    17.045    17.045 (------) | controlflow-re
|    83.053    83.076 (------) |    30.374    30.367 (------) | crypto-aes
|    32.046    32.118 (0.998x) |     4.764     4.824 (0.988x) | crypto-md5
|    19.861    19.889 (0.999x) |     6.418     6.431 (0.998x) | crypto-sha1
|    68.999    69.629 (0.991x) |    21.855    21.873 (0.999x) | date-format-to
|    70.822    70.819 (------) |     9.708     9.708 (------) | date-format-xp
|    44.891    44.899 (------) |    31.065    31.065 (------) | math-cordic
|    20.296    20.460 (0.992x) |     6.166     6.294 (0.980x) | math-partial-s
|    22.025    22.895 (0.962x) |    13.407    14.255 (0.941x) | math-spectral-
|    49.547    49.546 (------) |    34.585    34.585 (------) | regexp-dna
|    30.875    30.874 (------) |     9.406     9.406 (------) | string-base64
|    86.826    86.994 (0.998x) |    24.725    24.893 (0.993x) | string-fasta
|   112.613   112.584 (------) |    17.240    17.240 (------) | string-tagclou
|   165.725   165.503 (1.001x) |    21.488    21.488 (------) | string-unpack-
|    43.922    43.916 (------) |     8.557     8.557 (------) | string-validat
-------
|  1406.546  1407.965 (0.999x) |   610.254   610.543 (------) | all

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  1797.820  1319.981 (1.362x) |  1661.044  1170.390 (1.419x) | v8-crypto
|  1616.046  1613.366 (1.002x) |  1096.839  1096.817 (------) | v8-deltablue
|  1110.784  1110.782 (------) |   509.578   509.578 (------) | v8-earley-boye
|  1142.564  1142.751 (------) |   300.160   300.167 (------) | v8-raytrace
|   906.213   906.220 (------) |   374.699   374.699 (------) | v8-regexp
|  1196.863  1195.437 (1.001x) |  1165.764  1164.185 (1.001x) | v8-richards
|   820.692   802.518 (1.023x) |   115.480   115.019 (1.004x) | v8-splay
-------
|  8590.985  8091.059 (1.062x) |  5223.566  4730.857 (1.104x) | all

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  3307.707  3307.661 (------) |  3023.027  3023.027 (------) | ai-astar
|  3091.629  3091.844 (------) |  1766.183  1766.194 (------) | audio-beat-det
|  1518.265  1486.840 (1.021x) |  1344.176  1312.740 (1.024x) | audio-dft
|  3038.195  3038.213 (------) |  1743.312  1743.357 (------) | audio-fft
|  2786.270  2785.558 (------) |  1855.903  1855.888 (------) | audio-oscillat
|  7792.134  7502.078 (1.039x) |  5629.416  5339.076 (1.054x) | imaging-gaussi
|  3431.130  3431.297 (------) |   887.526   887.528 (------) | imaging-darkro
|  6787.941  6787.950 (------) |  4734.673  4734.675 (------) | imaging-desatu
|   696.571   696.357 (------) |     9.976     9.976 (------) | json-parse-fin
|   500.265   499.767 (1.001x) |     6.266     6.266 (------) | json-stringify
|  1552.147  1533.904 (1.012x) |   737.392   736.336 (1.001x) | stanford-crypt
|   791.470   829.318 (0.954x) |   373.539   374.598 (0.997x) | stanford-crypt
|  1901.987  1901.880 (------) |  1170.428  1170.429 (------) | stanford-crypt
|   566.638   566.621 (------) |   233.826   233.827 (------) | stanford-crypt
-------
| 37762.355 37459.293 (1.008x) | 23515.649 23193.923 (1.014x) | all


(Kraken numbers include the loading of the data file.)
Nice numbers!
Attachment #479246 - Flags: review?(nnethercote) → review+
(In reply to comment #2)
> Just to clarify:  with the current code we see an integer-multiply that results
> in a positive zero, but we go off-trace, and then end up generating a new trace
> with a double-multiply?  Is that right?

Sortof. I think we never actually get to the point of generating the integer path because the multiply we see during trace recording results in positive zero. The switch statement at the top of alu() detects this and generates a double multiply. There may be others where we fall off trace; I only stepped through one of them.
Attached patch updated patchSplinter Review
Unfortunately, I had to make some changes to this patch. It turns out it was causing the Gaussian blur problem in Bug 580468. The code that got emitted there was sub-optimal.

To justify why I think this new patch is improved, here is a list of the approaches I've tried for doing the negative zero check.

1. First I tried the following code:
    if result == 0 {
        if either operand is negative, leave trace
    }
This works well as long as the result is rarely zero. However, if the result is sometimes zero and sometimes nonzero, then the branch becomes unpredictable and performance becomes terrible on modern CPUs.

2. Then I just tried the strongest possible test:
    if result == 0 and (operand1 negative or operand2 negative):
        leave trace
This branch is better because it's very predictable. However, there are a lot more instructions executed than before, so it's a little slower on most of the benchmarks.

3. Then I tried a sort of band-aid. I checked if the result was zero during recording. If that happened, then I just emitted a simpler check:
    if either operand is negative, leave trace
This worked well in testing, because all the benchmarks where zero results occurred also had the property that a zero occurred during recording. However, changing HOTLOOP messed this up, since different values are seen during recording. This is what happened to Gaussian blur.

4. This patch tries to be more robust. It uses a weak test initially (it only checks if either operand is negative). If this check ever fails, we go off trace with a new exitType, MUL_ZERO_EXIT. This causes a recompilation of the trace using the stronger check (checking if the result is zero and one of the operands is negative). The oracle is used to decide which check to emit.

Anyway, it seems to perform well in testing and I think it's even a little easier to understand. However, it does require the additional complexity of a new exitType and oracle.
Attachment #479246 - Attachment is obsolete: true
Attachment #479979 - Flags: review?(nnethercote)
Comment on attachment 479979 [details] [diff] [review]
updated patch

First of all, I appreciate your measurement thoroughness with this bug, and
the nice explanation of what you tried in comment 4.


>+      case MUL_ZERO_EXIT:
>       case OVERFLOW_EXIT:
>-        traceMonitor->oracle->markInstructionUndemotable(cx->regs->pc);
>+        if (lr->exitType == MUL_ZERO_EXIT)
>+            traceMonitor->oracle->markInstructionSlowZeroTest(cx->regs->pc);
>+        else
>+            traceMonitor->oracle->markInstructionUndemotable(cx->regs->pc);
>         /* FALL THROUGH */
>       case BRANCH_EXIT:
>       case CASE_EXIT:
>         /* Abort recording the outer tree, extend the inner tree. */
>         if (AbortRecording(cx, "Inner tree is trying to grow, "
>                                "abort outer recording") == JIT_RESET) {
>             return ARECORD_ABORTED;
>         }

Better than this complicated falling through would be to move the common
tail code to the end of a function, give it a label, and use gotos in the
cases.


>+
>+            /*
>+             * Make sure we don't lose a -0. We exit if the result is zero and if
>+             * either operand is negative. We start out using a weaker guard, checking
>+             * if either argument is negative. If this ever fails, we recompile with
>+             * a stronger, but slower, guard.
>+             */

I've reworked the comments for this code below... I think it's clearer
because (a) it's clearer which test is which;  (b) it doesn't use unclear
"weaker" and "strong" descriptions;  (c) it's clearer exactly what each test
does;  (d) it makes it clearer the conditions under which each test is used.
Feel free to improve it further if you can see how to, or correct any
mistakes I've made.  Sorry for being picky, I can see myself scratching my
head over this code in 6 months if the comment isn't super-clear.

              /*
               * Make sure we don't lose a -0.  If either operand is
               * negative at record-time, or the oracle tells us to, we use
               * a slower test that that won't exit unless absolutely
               * necessary.  Otherwise, we use a faster test that may exit
               * unnecessarily (in which case the MUL_ZERO_EXIT condition
               * will tell the oracle to use the slow test if we recompile).
               */
>+            if (v0 < 0.0 || v1 < 0.0
>+                || !oracle || oracle->isInstructionSlowZeroTest(cx->regs->pc))
>+            {
>+                if (!exit)
>+                    exit = snapshot(OVERFLOW_EXIT);
>+
                  /*
                   * Slower test, won't exit unnecessarily:
                   *   exit if (result==0 && (d0 < 0 || d1 < 0))
                   */
>+                guard(false,
>+                      lir->ins2(LIR_andi,
>+                                lir->insEqI_0(result),
>+                                lir->ins2(LIR_ori,
>+                                          lir->ins2ImmI(LIR_lti, d0, 0),
>+                                          lir->ins2ImmI(LIR_lti, d1, 0))),
>+                      exit);
>+            } else {
                  /*
                   * Faster test, may exit unnecessarily:
                   *   exit if (d0 < 0)
                   *   exit if (d1 < 0)
                   */
>+                guardNonNeg(d0, d1, snapshot(MUL_ZERO_EXIT));
>+            }

BTW, are you sure you want the "if (v0 < 0.0 || v1 < 0.0" part in there?  I
ask because your description of this patch in comment 4 doesn't mention that
part.


r=me with the above changes.  No need to re-review.
Attachment #479979 - Flags: review?(nnethercote) → review+
(In reply to comment #5)
Thanks very much. I like the comments

The |v0 < 0.0 || v1 < 0.0| part is there for two reasons. First, it seems silly to emit the NonNeg guard if it is violated during the recording run. Second, the NonNeg guard asserts that its operands are not negative constants. It does this since that would cause the guard to always exit, which would cause a NanoJIT assertion (or maybe it won't after bug 601009?). The |v0 < 0.0 || v1 < 0.0| check ensures that the NonNeg assertions always succeed.
(In reply to comment #2)
"And TraceRecorder::alu() is really gross.  Sigh."

Bug filed?
http://hg.mozilla.org/mozilla-central/rev/a1989f4fa64b
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.