Use CMOVcc instead of index masking

RESOLVED FIXED in Firefox 59

Status

()

defect
RESOLVED FIXED
a year ago
7 months ago

People

(Reporter: jandem, Assigned: jandem)

Tracking

(Blocks 1 bug)

unspecified
mozilla60
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox59 fixed, firefox60 fixed)

Details

Attachments

(2 attachments)

Assignee

Description

a year ago
Using CMOVcc to zero the index register if OOB has a number of benefits:

* Performance

On x64 CMOVcc is a bit faster than the index masking, but it's significantly faster on x86 (since we use the slower masking scheme there). On Kraken I get the following on 32-bit:

no mitigations:  832 ms
index masking:  1011 ms (21.5% slower)
CMOVcc:          911 ms  (9.5% slower)

Furthermore, this is the naive way - in many cases (masm.boundsCheck32ForLoad, wasm bounds checks maybe?) we can fold the CMP with the CMP we do for the bounds check and that will result in more speedups.

* simplicity: it's much easier to reason about.
Depends on: 1435249
Assignee

Updated

a year ago
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Assignee

Updated

a year ago
Blocks: 1435266
Assignee

Comment 1

a year ago
Does anyone have an older machine, to benchmark a patch for this?

My MBP has an i7-4980HQ CPU but it would be great to verify these results on older CPUs. I'm also curious about AMD.
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(lhansen)
Assignee

Comment 2

a year ago
Newer architectures (Broadwell, Skylake) would be interesting too...
I have access to various ancient hardware, including older AMD and Intel chips (AMD FX-4100 and a couple of Macs that probably run Core-2 Duo, and some older i7 - at least).  Until Monday I also have my mom's consumer Lenovo laptop in the house for servicing :-P so I can even run something there...

My new Linux system is some Xeon Sandy Bridge thing, if that's interesting.
Flags: needinfo?(lhansen)
Assignee

Comment 4

a year ago
Great! I will work on a patch and do a Try push so we have builds we can compare.
Assignee

Comment 5

a year ago
Posted file Micro-benchmark
First, here's a silly micro-benchmark that runs in the browser (but it's easy to run it in the shell too, of course).

In the shell I got the following, best of 5 runs:

x64 no mitigations: 333 ms
x64 index masking:  323 ms
x64 CMOVcc:         316 ms

x86 no mitigations: 327 ms
x86 index masking:  420 ms
x86 CMOVcc:         314 ms

So:

(1) Yes, on this micro-benchmark we're slower without mitigations. Probably some weird cache effect but I'm not complaining.

(2) This also confirms CMOVcc is a small win on 64-bit on my machine, but on 32-bit the improvement is much bigger.
Assignee

Comment 6

a year ago
I have builds to compare index masking vs CMOVcc. Mitigations are enabled by default with these builds, so you don't need to change any prefs or use shell flags.

If anyone reading this could run Kraken [0] (and probably the micro-benchmark I attached) in a browser/shell build and report the results, that would be great. Ideally both 32-bit and 64-bit, but just one of them would help too.

Index masking: https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3f4c434f82e4f0c7df9f3404da2d93b073ac5

CMOVcc: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4efd393f6f3058858304268a6804134dd090bb0

[0] https://krakenbenchmark.mozilla.org/
Assignee

Comment 7

a year ago
Talos agrees this is an improvement. Linux64 Talos machines are Skylakes now (Xeon E3-1585Lv5) and it's a consistent 1-3% improvement there on Kraken compared to index masking. As expected, this is bigger on 32-bit: 8.83% on Win32.

So it looks like CMOV performs well on (newer) Intel CPUs. Still very interested in AMD results and older Intels.
AMD FX-6300 on Win7 (x64)

Kraken, three runs for each configuration (chronology order).
x64 58.0.1:        [1678.6ms +/- 4.8%, 1609.9ms +/- 1.4%, 1651.1ms +/- 2.5%]
x64 index masking: [1914.6ms +/- 5.6%, 1839.5ms +/- 1.3%, 1855.2ms +/- 1.5%]
x64 CMOVcc:        [1840.5ms +/- 4.0%, 1818.6ms +/- 2.1%, 1782.1ms +/- 3.8%]
x86 index masking: [2104.8ms +/- 2.8%, 2117.3ms +/- 2.6%, 2091.4ms +/- 3.3%]
x86 CMOVcc:        [1949.2ms +/- 2.9%, 1968.9ms +/- 2.3%, 1932.7ms +/- 3.8%]

Shell test case:
x64 tip@094c6dbe48a3: 214ms
x64 index masking:    370ms
x64 CMOVcc:           319ms
x86 index masking:    559ms
x86 CMOVcc:           318ms
Assignee

Comment 9

a year ago
(In reply to André Bargull [:anba] from comment #8)
> AMD FX-6300 on Win7 (x64)

Thanks a lot! I'm very happy to see CMOVcc is faster on AMD too - I was worried it would be slow.
I don't know if this is old enough to be interesting, but here are results for an Intel Core i7-3770K (Ivy Bridge, 2012) on Windows 10 x64.

Kraken, average result first, individual results in parentheses:
  x64 Nightly:       1145.5ms (1152.7ms 1144.3ms 1166.4ms 1135.1ms 1144.7ms 1129.8ms)
  x64 index masking: 1293.8ms (1314.9ms 1284.4ms 1294.6ms 1303.6ms 1280.4ms 1284.7ms)
  x64 CMOVcc:        1310.2ms (1293.2ms 1310.3ms 1332.2ms 1298.7ms 1315.1ms 1311.6ms)

  x86 Nightly:       1196.6ms (1201.9ms 1195.0ms 1186.8ms 1211.0ms 1189.0ms 1195.7ms)
  x86 index masking: 1428.3ms (1404.9ms 1424.1ms 1452.8ms 1421.8ms 1444.0ms 1422.3ms)
  x86 CMOVcc:        1359.7ms (1379.6ms 1366.7ms 1359.6ms 1324.1ms 1388.0ms 1340.0ms)

So on x64 it looks like a wash or a (very) slight regression, on x86 CMOVcc is a definite win.
Results for the microbenchmark (all builds compiled locally):

                median  min  mean  max  sd
x64 m-c tip:       304  300   305  321   3
x64 index masking: 316  300   318  359  14
x64 CMOVcc:        319  299   320  356  14
x86 m-c tip:       303  300   303  309   2
x86 index masking: 443  436   443  457   4
x86 CMOVcc:        315  299   318  349  13

Note that the difference between index masking and CMOVcc on x64 is not statistically significant after 200 runs.
Assignee

Comment 12

a year ago
Thanks! It's interesting the difference is smaller on that machine on x64. That makes me wonder about even older architectures like Core 2, maybe Lars can help us with that. But considering the other benefits, better perf on 32-bit and newer CPUs, we could take a perf regression there, as long as it's not too horrible. And we can still improve perf more by eliminating the second CMP in some cases.

Comment 13

a year ago
x64 nightly:

1035.7ms +/- 2.6%
1052.2ms +/- 2.6%
1089.0ms +/- 3.7%

Index Masking

x64:
1233.8ms +/- 4.0%
1182.7ms +/- 1.7%
1247.9ms +/- 3.5%

x86:
1503.0ms +/- 4.2% 
1411.1ms +/- 2.0%
1448.0ms +/- 5.9%

CMOVcc

x64:
1155.6ms +/- 1.7%
1168.4ms +/- 5.3%
1157.7ms +/- 2.2%

x86: 
1283.2ms +/- 3.1%
1315.2ms +/- 3.5%
1264.7ms +/- 2.1%
Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz 2.27GHz

Nightly x86: 3668.5ms +/- 5.3%
https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B175,180,193,175,184,200,174,176,192,197%5D,%22audio-beat-detection%22:%5B427,313,326,325,323,313,314,311,326,320%5D,%22audio-dft%22:%5B430,421,417,386,589,432,397,444,428,422%5D,%22audio-fft%22:%5B250,221,226,215,220,213,223,214,259,225%5D,%22audio-oscillator%22:%5B187,369,169,177,175,175,187,261,293,171%5D,%22imaging-gaussian-blur%22:%5B277,266,271,270,279,288,270,276,267,278%5D,%22imaging-darkroom%22:%5B358,348,372,346,358,340,347,340,351,351%5D,%22imaging-desaturate%22:%5B311,276,288,328,275,255,258,275,453,275%5D,%22json-parse-financial%22:%5B137,144,133,128,137,168,130,133,188,142%5D,%22json-stringify-tinderbox%22:%5B83,82,89,92,97,92,93,285,106,88%5D,%22stanford-crypto-aes%22:%5B167,167,177,175,196,172,164,175,167,171%5D,%22stanford-crypto-ccm%22:%5B286,272,242,268,306,267,257,309,413,252%5D,%22stanford-crypto-pbkdf2%22:%5B494,508,443,415,428,418,441,497,693,419%5D,%22stanford-crypto-sha256-iterative%22:%5B164,164,152,158,151,154,161,154,180,154%5D%7D


Index masking: 4550.1ms +/- 2.4%
https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B274,272,271,300,286,269,288,259,419,265%5D,%22audio-beat-detection%22:%5B469,432,362,351,362,387,438,366,451,352%5D,%22audio-dft%22:%5B565,597,616,651,622,543,593,573,582,599%5D,%22audio-fft%22:%5B242,347,307,319,261,280,315,284,248,247%5D,%22audio-oscillator%22:%5B226,433,288,339,297,549,226,293,222,250%5D,%22imaging-gaussian-blur%22:%5B472,503,444,440,482,449,446,437,447,450%5D,%22imaging-darkroom%22:%5B387,386,376,365,374,370,400,373,364,360%5D,%22imaging-desaturate%22:%5B398,348,338,358,392,323,360,361,376,335%5D,%22json-parse-financial%22:%5B200,141,136,138,173,154,151,138,167,152%5D,%22json-stringify-tinderbox%22:%5B119,105,138,139,115,104,103,113,109,125%5D,%22stanford-crypto-aes%22:%5B219,222,204,243,201,239,207,251,213,224%5D,%22stanford-crypto-ccm%22:%5B277,289,268,280,296,362,329,273,369,287%5D,%22stanford-crypto-pbkdf2%22:%5B445,516,480,502,470,446,454,532,440,444%5D,%22stanford-crypto-sha256-iterative%22:%5B187,166,168,187,183,200,440,192,169,206%5D%7D


CMOVcc: 4254.8ms +/- 5.6%
https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B247,193,202,194,192,212,220,196,187,187%5D,%22audio-beat-detection%22:%5B432,607,384,384,342,343,425,345,360,345%5D,%22audio-dft%22:%5B510,656,538,538,479,494,621,482,515,486%5D,%22audio-fft%22:%5B264,359,235,314,286,252,276,245,234,238%5D,%22audio-oscillator%22:%5B184,389,184,358,269,232,192,188,209,229%5D,%22imaging-gaussian-blur%22:%5B360,416,366,367,375,367,365,377,382,358%5D,%22imaging-darkroom%22:%5B380,373,372,372,398,366,382,376,394,381%5D,%22imaging-desaturate%22:%5B326,389,308,285,355,286,315,282,308,310%5D,%22json-parse-financial%22:%5B190,149,162,134,231,187,157,143,412,132%5D,%22json-stringify-tinderbox%22:%5B143,122,111,119,115,113,124,110,109,106%5D,%22stanford-crypto-aes%22:%5B249,196,196,202,279,193,209,200,202,197%5D,%22stanford-crypto-ccm%22:%5B535,270,292,268,469,275,275,279,269,271%5D,%22stanford-crypto-pbkdf2%22:%5B664,462,432,523,435,448,520,488,476,430%5D,%22stanford-crypto-sha256-iterative%22:%5B287,260,198,182,168,259,191,175,179,232%5D%7D
Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz 2.27GHz

micro-benchmark jsshell x86

no mitigations: 491
index masking: 1279
CMOVcc: 850
Assignee

Comment 16

a year ago
Thanks a lot people, very helpful.

Ekanan, just curious, is that 32-bit Windows? If 64-bit, it would be very interesting to have data for Firefox 64-bit on Core 2 :)
It's windows 10 32-bit :)

Comment 18

a year ago
I forgot to mention my build specs in #13

Win10 Pro x64 
i7 6700HQ
Huh, turns out both my Core 2 Duo systems run Mac OS X 32-bit, not really a current or interesting platform...  Anyway one of them is a model with the CPU designation T7600 @ 2.33GHz (MacBook Pro ca 2006).
Assignee

Comment 20

a year ago
Posted patch PatchSplinter Review
It turns out that this version:

  CMP index, length
  CMOVB index, output

is a lot faster than this one:

  CMP length, index
  CMOVA index, output

I don't understand why, but it repros on multiple benchmarks (index and length are both in registers and the code size is exactly the same).

Anyway, I replaced the first version with the second, then had to revert it because the first is faster. Oh well.
Attachment #8948819 - Flags: review?(luke)
Comment on attachment 8948819 [details] [diff] [review]
Patch

Review of attachment 8948819 [details] [diff] [review]:
-----------------------------------------------------------------

Faster, shorter, simpler; pick any three.
Attachment #8948819 - Flags: review?(luke) → review+

Comment 22

a year ago
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0bc556c6e060
Use CMOVcc instead of index masking. r=luke

Comment 23

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/0bc556c6e060
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Assignee

Comment 24

a year ago
Comment on attachment 8948819 [details] [diff] [review]
Patch

See bug 1431173 comment 6.
Attachment #8948819 - Flags: approval-mozilla-beta?
Comment on attachment 8948819 [details] [diff] [review]
Patch

Spectre-related fix. Taking for 59b8.
Attachment #8948819 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 27

a year ago
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz
Ubuntu 16.04

Micro benchmark:

CMOV:          511
Index Masking: 522

==================== Kraken ===================

--------------------- CMOV --------------------

-----------------------------------------------
Total:                       2368.4ms +/- 12.7%
-----------------------------------------------

  ai:                         132.7ms +/- 17.0%
    astar:                    132.7ms +/- 17.0%

  audio:                      757.9ms +/- 9.9%
    beat-detection:           171.8ms +/- 15.8%
    dft:                      354.9ms +/- 9.9%
    fft:                      109.3ms +/- 16.0%
    oscillator:               121.9ms +/- 19.2%

  imaging:                    610.1ms +/- 17.7%
    gaussian-blur:            197.8ms +/- 21.3%
    darkroom:                 222.3ms +/- 18.6%
    desaturate:               190.0ms +/- 13.3%

  json:                       146.7ms +/- 26.8%
    parse-financial:           88.8ms +/- 36.5%
    stringify-tinderbox:       57.9ms +/- 21.6%

  stanford:                   721.0ms +/- 17.6%
    crypto-aes:               134.2ms +/- 17.8%
    crypto-ccm:               221.5ms +/- 20.8%
    crypto-pbkdf2:            256.1ms +/- 20.5%
    crypto-sha256-iterative:  109.2ms +/- 30.7%


-------------------- Index Masking ------------

-----------------------------------------------
Total:                       2574.1ms +/- 11.7%
-----------------------------------------------

  ai:                         139.9ms +/- 16.6%
    astar:                    139.9ms +/- 16.6%

  audio:                      887.8ms +/- 12.9%
    beat-detection:           190.9ms +/- 16.3%
    dft:                      422.5ms +/- 13.2%
    fft:                      140.6ms +/- 28.9%
    oscillator:               133.8ms +/- 17.0%

  imaging:                    626.9ms +/- 18.0%
    gaussian-blur:            210.6ms +/- 22.9%
    darkroom:                 229.7ms +/- 18.0%
    desaturate:               186.6ms +/- 19.1%

  json:                       146.1ms +/- 23.5%
    parse-financial:           75.3ms +/- 21.6%
    stringify-tinderbox:       70.8ms +/- 39.8%

  stanford:                   773.4ms +/- 20.3%
    crypto-aes:               144.6ms +/- 19.0%
    crypto-ccm:               271.0ms +/- 39.1%
    crypto-pbkdf2:            262.2ms +/- 17.9%
    crypto-sha256-iterative:   95.6ms +/- 20.6%


============================== Octane ==========================

-------------------------------- CMOV ---------------------------


Octane 2.0
Octane Score: 17463
Richards
19691 Core language features
Deltablue

23683
Core language features
Crypto

16758
Bit & Math operations
Raytrace

57218
Core language features
EarleyBoyer

15394
Memory & GC
Regexp

2898
Strings & arrays
Splay

13470
Memory & GC
SplayLatency

13934
GC latency
NavierStokes

26284
Strings & arrays
pdf.js

11430
Strings & arrays
Mandreel

12715
Virtual machine
MandreelLatency

11785
Compiler latency
GB Emulator

33341
Virtual machine
CodeLoad

14284
Loading & Parsing
Box2DWeb

21731
Bit & Math operations
zlib

53209
asm.js
Typescript

14073
Virtual machine & GC

------------------------------ Index Masking -------------------

Octane 2.0
Octane Score: 16929
Richards
20606 Core language features
Deltablue

21323
Core language features
Crypto

16531
Bit & Math operations
Raytrace

53057
Core language features
EarleyBoyer

15761
Memory & GC
Regexp

2286
Strings & arrays
Splay

13438
Memory & GC
SplayLatency

15719
GC latency
NavierStokes

26389
Strings & arrays
pdf.js

7344
Strings & arrays
Mandreel

14224
Virtual machine
MandreelLatency

17250
Compiler latency
GB Emulator

37916
Virtual machine
CodeLoad

13920
Loading & Parsing
Box2DWeb

18199
Bit & Math operations
zlib

35333
asm.js
Typescript

16272
Virtual machine & GC

Comment 28

a year ago
Ooops! Please, ignore my comment #27.

The correct results are:

Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz
Ubuntu 16.04

MICRO BENCHMARK
---------------
CMOV:          520
INDEX MASKING: 559

KRAKEN
-------
CMOV:          2394.7ms +/- 16.6%
INDEX MASKING: 2609.1ms +/- 10.4%


CMOV - KRAKEN - details
================================

Kraken JavaScript Benchmark Results
Content Version: kraken-1.1

Run Again


(You can bookmark this results URL for later comparison.)
To compare to another run, paste a saved result URL in the text field below and press enter:



===============================================
RESULTS (means and 95% confidence intervals)
-----------------------------------------------
Total:                       2394.7ms +/- 16.6%
-----------------------------------------------

  ai:                         126.9ms +/- 16.3%
    astar:                    126.9ms +/- 16.3%

  audio:                      896.9ms +/- 25.1%
    beat-detection:           174.1ms +/- 18.5%
    dft:                      457.1ms +/- 39.5%
    fft:                      134.2ms +/- 30.5%
    oscillator:               131.5ms +/- 19.9%

  imaging:                    592.7ms +/- 20.3%
    gaussian-blur:            190.1ms +/- 21.9%
    darkroom:                 219.9ms +/- 21.7%
    desaturate:               182.7ms +/- 18.1%

  json:                       127.8ms +/- 17.9%
    parse-financial:           77.2ms +/- 20.1%
    stringify-tinderbox:       50.6ms +/- 19.2%

  stanford:                   650.4ms +/- 11.4%
    crypto-aes:               123.2ms +/- 13.3%
    crypto-ccm:               211.5ms +/- 34.2%
    crypto-pbkdf2:            220.1ms +/- 7.8%
    crypto-sha256-iterative:   95.6ms +/- 15.1%

INDEX MASKING - KRAKEN - details
=================================


Kraken JavaScript Benchmark Results
Content Version: kraken-1.1

Run Again


(You can bookmark this results URL for later comparison.)
To compare to another run, paste a saved result URL in the text field below and press enter:



===============================================
RESULTS (means and 95% confidence intervals)
-----------------------------------------------
Total:                        2609.1ms +/- 10.4%
-----------------------------------------------

  ai:                          160.9ms +/- 16.1%
    astar:                     160.9ms +/- 16.1%

  audio:                      1028.7ms +/- 24.7%
    beat-detection:            250.7ms +/- 31.2%
    dft:                       518.2ms +/- 32.4%
    fft:                       133.9ms +/- 35.8%
    oscillator:                125.9ms +/- 20.5%

  imaging:                     568.9ms +/- 12.9%
    gaussian-blur:             195.2ms +/- 16.0%
    darkroom:                  205.7ms +/- 15.6%
    desaturate:                168.0ms +/- 11.8%

  json:                        138.3ms +/- 23.4%
    parse-financial:            82.5ms +/- 29.0%
    stringify-tinderbox:        55.8ms +/- 37.5%

  stanford:                    712.3ms +/- 15.6%
    crypto-aes:                156.6ms +/- 44.3%
    crypto-ccm:                192.8ms +/- 22.6%
    crypto-pbkdf2:             249.5ms +/- 15.3%
    crypto-sha256-iterative:   113.4ms +/- 22.4%
Assignee

Comment 29

a year ago
(In reply to tomaspartl from comment #28)
> The correct results are:

Thank you for these results!
Flags: needinfo?(nicolas.b.pierron)
You need to log in before you can comment on or make changes to this bug.